Invoice Processing Guide

This guide walks you through processing invoices with the DocDigitizer API. Learn how to extract invoice data, handle different invoice formats, and integrate invoice processing into your applications.

Overview

DocDigitizer’s invoice processing extracts structured data from commercial invoices, including header information, line items, totals, and payment details.

Supported Invoice Types

  • Commercial invoices
  • Tax invoices
  • Pro forma invoices
  • Credit notes
  • Debit notes
  • Self-billing invoices

Supported Countries

Invoice extraction supports country-specific formats for:

Region Countries
Europe Portugal (PT), Spain (ES), France (FR), Germany (DE), Italy (IT), UK (GB), Netherlands (NL), Belgium (BE)
Americas United States (US), Brazil (BR), Mexico (MX), Canada (CA)
Other Generic format for unlisted countries

Invoice Fields

The following fields are extracted from invoices. Field availability depends on the invoice content.

Header Fields

Field Type Description
invoiceNumber String Invoice number or reference
invoiceDate String (Date) Invoice issue date (YYYY-MM-DD)
dueDate String (Date) Payment due date
purchaseOrderNumber String Related PO number
currency String Currency code (EUR, USD, GBP, etc.)

Vendor (Seller) Fields

Field Type Description
vendorName String Vendor/supplier company name
vendorTaxId String Vendor tax ID / VAT number
vendorAddress String Full vendor address
vendorEmail String Vendor email address
vendorPhone String Vendor phone number

Customer (Buyer) Fields

Field Type Description
customerName String Customer/buyer company name
customerTaxId String Customer tax ID / VAT number
customerAddress String Full customer address
shippingAddress String Shipping/delivery address (if different)

Financial Fields

Field Type Description
subtotal Number Total before tax
taxRate Number Tax rate percentage
taxAmount Number Total tax amount
discount Number Discount amount
totalAmount Number Grand total including tax

Payment Fields

Field Type Description
paymentTerms String Payment terms (Net 30, etc.)
paymentMethod String Payment method
bankName String Bank name
bankAccount String Bank account number / IBAN
swiftCode String SWIFT/BIC code

Line Item Fields

Field Type Description
lineItems Array Array of line item objects
lineItems[].description String Item description
lineItems[].quantity Number Quantity
lineItems[].unit String Unit of measure
lineItems[].unitPrice Number Price per unit
lineItems[].amount Number Line total
lineItems[].taxRate Number Tax rate for this item
lineItems[].productCode String Product/SKU code

Processing Invoices

Basic Invoice Processing

cURL

curl -X POST https://apix.docdigitizer.com/sync \
  -H "x-api-key: YOUR_API_KEY" \
  -F "files=@invoice.pdf" \
  -F "id=$(uuidgen)" \
  -F "contextId=YOUR_CONTEXT_ID"
    

PowerShell

# Process an invoice
$headers = @{ "x-api-key" = "YOUR_API_KEY" }

$form = @{
    files = Get-Item -Path "invoice.pdf"
    id = [guid]::NewGuid().ToString()
    contextId = "YOUR_CONTEXT_ID"
}

$response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
    -Method Post -Headers $headers -Form $form

# Check result
if ($response.StateText -eq "COMPLETED") {
    $invoice = $response.Output | Where-Object { $_.docType -eq "Invoice" } | Select-Object -First 1
    Write-Host "Invoice Number: $($invoice.extraction.invoiceNumber)"
    Write-Host "Total: $($invoice.extraction.totalAmount) $($invoice.extraction.currency)"
}
    

Python

import requests
import uuid

def process_invoice(pdf_path, api_key, context_id):
    response = requests.post(
        "https://apix.docdigitizer.com/sync",
        headers={"x-api-key": api_key},
        files={"files": open(pdf_path, "rb")},
        data={
            "id": str(uuid.uuid4()),
            "contextId": context_id
        }
    )

    result = response.json()

    if result["StateText"] == "COMPLETED":
        # Find the invoice in the output
        for doc in result["Output"]:
            if doc["docType"] == "Invoice":
                return doc["extraction"]

    return None

# Usage
invoice_data = process_invoice("invoice.pdf", "YOUR_API_KEY", "YOUR_CONTEXT_ID")
if invoice_data:
    print(f"Invoice: {invoice_data.get('invoiceNumber')}")
    print(f"Total: {invoice_data.get('totalAmount')} {invoice_data.get('currency')}")
    

Working with Results

Example Response

{
    "StateText": "COMPLETED",
    "TraceId": "INV4567",
    "NumberPages": 2,
    "Output": [
        {
            "docType": "Invoice",
            "country": "PT",
            "pages": [1, 2],
            "schema": "Invoice_PT.json",
            "extraction": {
                "invoiceNumber": "FT 2024/00156",
                "invoiceDate": "2024-01-15",
                "dueDate": "2024-02-14",
                "vendorName": "Tech Solutions, Lda",
                "vendorTaxId": "PT509876543",
                "vendorAddress": "Av. da Liberdade, 100, 1250-096 Lisboa",
                "customerName": "Global Corp, SA",
                "customerTaxId": "PT501234567",
                "customerAddress": "Rua do Ouro, 50, 1100-063 Lisboa",
                "subtotal": 2500.00,
                "taxRate": 23,
                "taxAmount": 575.00,
                "totalAmount": 3075.00,
                "currency": "EUR",
                "paymentTerms": "30 dias",
                "bankAccount": "PT50 0035 0000 12345678901 94",
                "lineItems": [
                    {
                        "description": "Software Development Services",
                        "quantity": 40,
                        "unit": "hours",
                        "unitPrice": 50.00,
                        "amount": 2000.00,
                        "taxRate": 23
                    },
                    {
                        "description": "Cloud Hosting - January",
                        "quantity": 1,
                        "unit": "month",
                        "unitPrice": 500.00,
                        "amount": 500.00,
                        "taxRate": 23
                    }
                ]
            }
        }
    ]
}
    

Accessing Invoice Data

PowerShell

# Assuming $response contains the API response

$invoice = $response.Output[0].extraction

# Header information
$invoiceNumber = $invoice.invoiceNumber
$invoiceDate = $invoice.invoiceDate
$dueDate = $invoice.dueDate

# Vendor information
$vendor = @{
    Name = $invoice.vendorName
    TaxId = $invoice.vendorTaxId
    Address = $invoice.vendorAddress
}

# Financial summary
$totals = @{
    Subtotal = $invoice.subtotal
    Tax = $invoice.taxAmount
    Total = $invoice.totalAmount
    Currency = $invoice.currency
}

# Display summary
Write-Host "=== Invoice $invoiceNumber ==="
Write-Host "Date: $invoiceDate | Due: $dueDate"
Write-Host "Vendor: $($vendor.Name) ($($vendor.TaxId))"
Write-Host "Total: $($totals.Total) $($totals.Currency)"
    

Python

def display_invoice(extraction):
    """Display invoice data in a readable format."""

    print(f"{'='*50}")
    print(f"INVOICE: {extraction.get('invoiceNumber', 'N/A')}")
    print(f"{'='*50}")

    print(f"\nDate: {extraction.get('invoiceDate', 'N/A')}")
    print(f"Due:  {extraction.get('dueDate', 'N/A')}")

    print(f"\n--- VENDOR ---")
    print(f"Name:    {extraction.get('vendorName', 'N/A')}")
    print(f"Tax ID:  {extraction.get('vendorTaxId', 'N/A')}")
    print(f"Address: {extraction.get('vendorAddress', 'N/A')}")

    print(f"\n--- CUSTOMER ---")
    print(f"Name:    {extraction.get('customerName', 'N/A')}")
    print(f"Tax ID:  {extraction.get('customerTaxId', 'N/A')}")

    print(f"\n--- TOTALS ---")
    currency = extraction.get('currency', 'EUR')
    print(f"Subtotal: {extraction.get('subtotal', 0):.2f} {currency}")
    print(f"Tax:      {extraction.get('taxAmount', 0):.2f} {currency}")
    print(f"TOTAL:    {extraction.get('totalAmount', 0):.2f} {currency}")
    

Handling Line Items

Line items are returned as an array. Here’s how to process them:

PowerShell

# Process line items
$lineItems = $invoice.lineItems

Write-Host "`n=== Line Items ==="
Write-Host ("-" * 80)
Write-Host ("{0,-40} {1,10} {2,12} {3,12}" -f "Description", "Qty", "Unit Price", "Amount")
Write-Host ("-" * 80)

$total = 0
foreach ($item in $lineItems) {
    Write-Host ("{0,-40} {1,10} {2,12:N2} {3,12:N2}" -f `
        $item.description.Substring(0, [Math]::Min(40, $item.description.Length)),
        $item.quantity,
        $item.unitPrice,
        $item.amount)
    $total += $item.amount
}

Write-Host ("-" * 80)
Write-Host ("{0,-40} {1,10} {2,12} {3,12:N2}" -f "TOTAL", "", "", $total)
    

Python

def process_line_items(extraction):
    """Process and display line items."""

    line_items = extraction.get('lineItems', [])

    if not line_items:
        print("No line items found")
        return

    print(f"\n{'Description':<40} {'Qty':>8} {'Price':>12} {'Amount':>12}")
    print("-" * 75)

    total = 0
    for item in line_items:
        desc = item.get('description', 'N/A')[:40]
        qty = item.get('quantity', 0)
        price = item.get('unitPrice', 0)
        amount = item.get('amount', 0)

        print(f"{desc:<40} {qty:>8} {price:>12.2f} {amount:>12.2f}")
        total += amount

    print("-" * 75)
    print(f"{'TOTAL':<40} {'':>8} {'':>12} {total:>12.2f}")

    return line_items
    

Validating Line Items

def validate_line_items(extraction):
    """Validate that line items sum to subtotal."""

    line_items = extraction.get('lineItems', [])
    subtotal = extraction.get('subtotal', 0)

    calculated_total = sum(item.get('amount', 0) for item in line_items)

    # Allow small rounding differences
    tolerance = 0.01
    if abs(calculated_total - subtotal) > tolerance:
        print(f"Warning: Line items ({calculated_total:.2f}) don't match subtotal ({subtotal:.2f})")
        return False

    return True
    

Multi-Page Invoices

DocDigitizer handles multi-page invoices automatically, combining data from all pages into a single extraction result.

How It Works

  • All pages of an invoice are analyzed together
  • Header information is typically from page 1
  • Line items are collected from all pages
  • Totals are extracted from the summary section (often last page)

Identifying Page Numbers

The pages array in the response shows which pages contain the invoice:

{
    "docType": "Invoice",
    "pages": [1, 2, 3],  // Invoice spans pages 1-3
    "extraction": { ... }
}
    

Example: Processing Multi-Page Invoice

# Check if invoice spans multiple pages
$invoice = $response.Output[0]
$pageCount = $invoice.pages.Count

if ($pageCount -gt 1) {
    Write-Host "Multi-page invoice detected: Pages $($invoice.pages -join ', ')"
}

# Line items from all pages are combined
$totalLineItems = $invoice.extraction.lineItems.Count
Write-Host "Total line items extracted: $totalLineItems"
    

Batch Processing

Process multiple invoices efficiently with batch processing.

PowerShell Batch Processing

function Process-InvoiceBatch {
    param(
        [string]$FolderPath,
        [string]$ApiKey,
        [string]$ContextId,
        [string]$OutputCsv = "results.csv"
    )

    $pdfFiles = Get-ChildItem -Path $FolderPath -Filter "*.pdf"
    $results = @()

    foreach ($pdf in $pdfFiles) {
        Write-Host "Processing: $($pdf.Name)..."

        try {
            $form = @{
                files = Get-Item -Path $pdf.FullName
                id = [guid]::NewGuid().ToString()
                contextId = $ContextId
            }

            $response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
                -Method Post `
                -Headers @{ "x-api-key" = $ApiKey } `
                -Form $form

            if ($response.StateText -eq "COMPLETED") {
                $invoice = $response.Output | Where-Object { $_.docType -eq "Invoice" }

                if ($invoice) {
                    $results += [PSCustomObject]@{
                        FileName = $pdf.Name
                        Status = "Success"
                        InvoiceNumber = $invoice.extraction.invoiceNumber
                        VendorName = $invoice.extraction.vendorName
                        InvoiceDate = $invoice.extraction.invoiceDate
                        TotalAmount = $invoice.extraction.totalAmount
                        Currency = $invoice.extraction.currency
                        TraceId = $response.TraceId
                    }
                }
            } else {
                $results += [PSCustomObject]@{
                    FileName = $pdf.Name
                    Status = "Error"
                    InvoiceNumber = ""
                    VendorName = ""
                    InvoiceDate = ""
                    TotalAmount = ""
                    Currency = ""
                    TraceId = $response.TraceId
                }
            }
        }
        catch {
            $results += [PSCustomObject]@{
                FileName = $pdf.Name
                Status = "Failed: $_"
                InvoiceNumber = ""
                VendorName = ""
                InvoiceDate = ""
                TotalAmount = ""
                Currency = ""
                TraceId = ""
            }
        }

        # Respect rate limits
        Start-Sleep -Milliseconds 500
    }

    # Export results
    $results | Export-Csv -Path $OutputCsv -NoTypeInformation
    Write-Host "`nProcessed $($results.Count) invoices. Results saved to $OutputCsv"

    return $results
}

# Usage
Process-InvoiceBatch -FolderPath "C:\Invoices" -ApiKey "YOUR_KEY" -ContextId "YOUR_CONTEXT"
    

Python Batch Processing

import os
import csv
import uuid
import time
import requests
from pathlib import Path

def process_invoice_batch(folder_path, api_key, context_id, output_csv="results.csv"):
    """Process all PDF invoices in a folder."""

    pdf_files = list(Path(folder_path).glob("*.pdf"))
    results = []

    for pdf_file in pdf_files:
        print(f"Processing: {pdf_file.name}...")

        try:
            with open(pdf_file, "rb") as f:
                response = requests.post(
                    "https://apix.docdigitizer.com/sync",
                    headers={"x-api-key": api_key},
                    files={"files": f},
                    data={
                        "id": str(uuid.uuid4()),
                        "contextId": context_id
                    },
                    timeout=300
                )

            result = response.json()

            if result.get("StateText") == "COMPLETED":
                invoice = next(
                    (doc for doc in result.get("Output", [])
                     if doc.get("docType") == "Invoice"),
                    None
                )

                if invoice:
                    ext = invoice.get("extraction", {})
                    results.append({
                        "filename": pdf_file.name,
                        "status": "Success",
                        "invoice_number": ext.get("invoiceNumber", ""),
                        "vendor_name": ext.get("vendorName", ""),
                        "invoice_date": ext.get("invoiceDate", ""),
                        "total_amount": ext.get("totalAmount", ""),
                        "currency": ext.get("currency", ""),
                        "trace_id": result.get("TraceId", "")
                    })
                    continue

            results.append({
                "filename": pdf_file.name,
                "status": "Error",
                "invoice_number": "",
                "vendor_name": "",
                "invoice_date": "",
                "total_amount": "",
                "currency": "",
                "trace_id": result.get("TraceId", "")
            })

        except Exception as e:
            results.append({
                "filename": pdf_file.name,
                "status": f"Failed: {e}",
                "invoice_number": "",
                "vendor_name": "",
                "invoice_date": "",
                "total_amount": "",
                "currency": "",
                "trace_id": ""
            })

        # Respect rate limits
        time.sleep(0.5)

    # Export to CSV
    with open(output_csv, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=results[0].keys())
        writer.writeheader()
        writer.writerows(results)

    print(f"\nProcessed {len(results)} invoices. Results saved to {output_csv}")
    return results

# Usage
process_invoice_batch("/path/to/invoices", "YOUR_API_KEY", "YOUR_CONTEXT_ID")
    

Integration Examples

Export to Accounting Software

def export_to_accounting_format(extraction):
    """Convert extraction to common accounting import format."""

    return {
        "document_type": "INVOICE",
        "document_number": extraction.get("invoiceNumber"),
        "document_date": extraction.get("invoiceDate"),
        "due_date": extraction.get("dueDate"),
        "supplier": {
            "name": extraction.get("vendorName"),
            "tax_id": extraction.get("vendorTaxId"),
            "address": extraction.get("vendorAddress")
        },
        "amounts": {
            "net_amount": extraction.get("subtotal"),
            "tax_amount": extraction.get("taxAmount"),
            "gross_amount": extraction.get("totalAmount"),
            "currency": extraction.get("currency")
        },
        "lines": [
            {
                "description": item.get("description"),
                "quantity": item.get("quantity"),
                "unit_price": item.get("unitPrice"),
                "amount": item.get("amount"),
                "tax_rate": item.get("taxRate")
            }
            for item in extraction.get("lineItems", [])
        ],
        "payment": {
            "terms": extraction.get("paymentTerms"),
            "bank_account": extraction.get("bankAccount")
        }
    }
    

Database Integration

import sqlite3

def save_invoice_to_db(extraction, db_path="invoices.db"):
    """Save extracted invoice to SQLite database."""

    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Create tables if not exist
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS invoices (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            invoice_number TEXT,
            invoice_date TEXT,
            due_date TEXT,
            vendor_name TEXT,
            vendor_tax_id TEXT,
            customer_name TEXT,
            subtotal REAL,
            tax_amount REAL,
            total_amount REAL,
            currency TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS invoice_lines (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            invoice_id INTEGER,
            description TEXT,
            quantity REAL,
            unit_price REAL,
            amount REAL,
            FOREIGN KEY (invoice_id) REFERENCES invoices (id)
        )
    ''')

    # Insert invoice
    cursor.execute('''
        INSERT INTO invoices (
            invoice_number, invoice_date, due_date, vendor_name, vendor_tax_id,
            customer_name, subtotal, tax_amount, total_amount, currency
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    ''', (
        extraction.get("invoiceNumber"),
        extraction.get("invoiceDate"),
        extraction.get("dueDate"),
        extraction.get("vendorName"),
        extraction.get("vendorTaxId"),
        extraction.get("customerName"),
        extraction.get("subtotal"),
        extraction.get("taxAmount"),
        extraction.get("totalAmount"),
        extraction.get("currency")
    ))

    invoice_id = cursor.lastrowid

    # Insert line items
    for item in extraction.get("lineItems", []):
        cursor.execute('''
            INSERT INTO invoice_lines (invoice_id, description, quantity, unit_price, amount)
            VALUES (?, ?, ?, ?, ?)
        ''', (
            invoice_id,
            item.get("description"),
            item.get("quantity"),
            item.get("unitPrice"),
            item.get("amount")
        ))

    conn.commit()
    conn.close()

    return invoice_id
    

Best Practices

Document Quality

  • Use minimum 150 DPI for scanned invoices (300 DPI recommended)
  • Ensure text is legible and not blurry
  • Avoid skewed or rotated documents
  • Native PDFs (not scanned) provide best results

Data Validation

  • Validate extracted totals against line item sums
  • Check tax calculations are correct
  • Verify vendor/customer tax IDs are in correct format
  • Validate date formats before database insertion

Error Handling

  • Always check StateText for “COMPLETED”
  • Handle missing fields gracefully (use defaults)
  • Log TraceId for troubleshooting
  • Implement retry logic for transient failures

Performance

  • Respect rate limits (use delays between requests)
  • Process invoices in parallel where possible (within limits)
  • Cache results for already-processed invoices
  • Use batch processing for large volumes