Back to blog
Written by Andrei BiroLast updated

PDF invoice extraction: how to extract data from invoice PDFs automatically

March 2026

Extracting data from PDF invoices is one of the most tedious tasks in accounting. Every vendor generates PDFs differently — different layouts, fonts, languages, and structures. What looks simple to the human eye (read the total, find the vendor name) is surprisingly hard to automate. This article explains how PDF invoice extraction works, why it's difficult, and how modern tools solve it.

Why PDF invoice extraction is hard

PDFs were designed for visual display, not data extraction. Unlike a spreadsheet or JSON file, a PDF doesn't label its fields. The "total" is just text positioned at certain coordinates on the page. There is no semantic tag that says "this is the invoice amount."

  • No consistent layout — every vendor puts amounts, dates, and totals in different positions
  • Scanned vs. digital — scanned invoices are images, not text. You need OCR before any extraction
  • Multilingual invoices — "Total", "Gesamt", "Toplam", "Total de plată" all mean the same thing
  • Currency ambiguity — "1.234,56" means 1234.56 EUR in Germany but something else in the US
  • Multiple amounts — subtotal, tax, shipping, discount, total, amount due — which one matters?
  • Embedded fonts and encodings — some PDFs use custom fonts that break standard text extraction

How PDF text extraction works

The first step in any PDF invoice parser is getting the raw text. Digital PDFs (the kind generated by billing software) have an embedded text layer. Tools like pdfplumber or PyMuPDF can extract this text along with its position on the page.

For scanned invoices (photos or scanned paper documents), traditional OCR tools like Tesseract or cloud APIs from Google/AWS convert the image to text first. But a newer approach has emerged: AI vision models. Instead of a separate OCR step followed by text parsing, AI vision models like AI can look at a scanned PDF or photo directly and extract structured data — vendor name, amount, currency, date — in a single pass. This is significantly more accurate than traditional OCR pipelines because the model understands document layout, not just character shapes. The same AI vision capability also handles handwritten receipts — photographed notes from local vendors, handwritten totals on paper receipts — extracting the key fields with reasonable accuracy for most common handwriting styles.

The good news: most invoices from SaaS vendors, cloud providers, and online services are digitally generated. They have clean text layers that can be extracted instantly and accurately — no OCR needed. And for the rest (scanned receipts, photographed invoices), AI vision handles them with high accuracy.

Finding the right numbers: amount and currency detection

Once you have the raw text, the real challenge begins: figuring out which number is the invoice total. A typical invoice PDF contains dozens of numbers — dates, quantities, unit prices, tax percentages, subtotals, and the final amount.

Smart extraction uses a combination of techniques:

  • Keyword proximity — look for numbers near "Total", "Amount Due", "Grand Total" in multiple languages
  • Regex patterns — match currency-specific formats like €1,234.56, 1.234,56 EUR, $500.00, or 2.500,00 RON
  • Position heuristics — totals tend to appear in the bottom third of the page, right-aligned
  • Currency symbol detection — identify €, $, £, RON, lei, and other currency markers near amounts
  • Format-aware parsing — correctly interpret European (1.234,56) vs. US (1,234.56) number formats

Vendor identification

Knowing which vendor issued the invoice makes extraction dramatically easier. If you know it's a Hetzner invoice, you know exactly where to look for the total and what format to expect. This is why vendor registries are so valuable.

Vendor identification works by matching the sender's email domain (billing@stripe.com), the PDF's metadata, or recognizable text patterns in the document itself (company names, VAT numbers, known layouts). Once the vendor is identified, a vendor-specific extraction profile can be applied for higher accuracy.

Approaches to invoice data extraction compared

Manual tools: Tabula, Camelot, pdfplumber

Open-source libraries that extract tables from PDFs. They work well for structured, table-heavy invoices but require manual configuration per vendor. You write the extraction rules yourself. Great for one-off scripts, impractical for ongoing use with 50+ vendors.

Best for: developers building custom pipelines for a single vendor format.

Enterprise OCR: ABBYY, Kofax, Rossum

Enterprise-grade solutions with AI-powered extraction. They handle scanned documents, handwriting, and complex layouts. But they're priced for companies processing thousands of invoices per month — typically $500-2000+/month, with setup and training required.

Best for: large enterprises with high-volume accounts payable departments.

Cloud AI APIs: Google Document AI, AWS Textract

Pay-per-page APIs that extract structured data from documents. More accessible than enterprise solutions but still require integration work — you need to build the pipeline that sends PDFs to the API, handles responses, maps fields, and stores results. Per-page pricing adds up with volume.

Best for: developers comfortable with API integration who need a generic solution.

BillyBox: Automated extraction from email

BillyBox takes a different approach: instead of giving you an extraction tool, it handles the entire pipeline. Connect your Gmail, Outlook, Zoho, or any IMAP email, and BillyBox fetches invoices, uses AI to filter out non-invoices, extracts data from both digital and scanned PDFs (via AI vision OCR), downloads invoices from billing portals linked in emails, and generates PDF receipts from email content when no attachment exists. You can also manually upload PDFs, XML files, or photos. No configuration per vendor. No API integration.

Best for: freelancers and small businesses who want results, not a tool to configure.

How BillyBox extracts invoice data

BillyBox combines multiple extraction strategies to handle the diversity of real-world invoices:

  • AI classification gate — before extraction even begins, an AI model (AI) analyzes each attachment to determine if it's a real invoice. Logos, marketing PDFs, shipping labels, and other noise are automatically filtered out. Only real invoices proceed to data extraction.
  • AI-powered OCR for scanned documents — when a PDF contains scanned images instead of text (or when you upload a photo of a receipt), BillyBox uses AI vision to extract data directly from the image. No separate OCR step — the AI model reads the document visually and returns structured fields with confidence scores.
  • 50+ vendor-specific patterns — for common vendors like AWS, Stripe, DigitalOcean, Hetzner, Google Cloud, and Anthropic, BillyBox knows exactly where to find the total and in what format to expect it.
  • Smart amount detection — for unknown vendors, BillyBox scans the full text layer using regex patterns that handle European, US, and mixed number formats. It ranks candidate amounts by proximity to keywords like "Total" and "Amount Due" in 10+ languages.
  • Multi-currency support — EUR, USD, RON, GBP, and other currencies are detected from symbols, ISO codes, and contextual text. European comma-decimal formats are handled correctly.
  • XML e-invoice parsing — many European invoices use structured XML formats (UBL, CII). BillyBox parses these directly for perfect accuracy — no guessing needed.
  • Email context — the sender domain, subject line, and email body provide additional signals for vendor identification and amount verification.
  • Portal invoice download — when an email contains a link to download an invoice from a billing portal (Stripe dashboard, AWS billing, FreshBooks), BillyBox detects the URL and fetches the actual PDF. If automatic download isn't possible, the link is shown so you can grab it manually.
  • Receipt generation from email content — when there's no PDF attachment and no downloadable link, BillyBox generates a clean PDF receipt from the email body, extracting vendor, amount, date, and formatting it as a proper document. Payment confirmations from services like PayPal or utility providers become real PDF receipts automatically.

What gets extracted

For each invoice, BillyBox extracts and displays:

Issuer / vendor name— e.g., Hetzner, AWS, Notion
Invoice amount— e.g., €49.90, $127.00
Currency— EUR, USD, RON, GBP
Invoice number— e.g., INV-2026-0384
Invoice date— from PDF or email date
Issued to— recipient name or company
Description— line item summary
Tax amount— VAT / sales tax extracted separately
Subtotal— pre-tax amount when available
Sender email— original email address

All extracted data is shown alongside a PDF preview so you can verify at a glance. If extraction misses or misreads something, you can edit any field inline — issuer, amount, currency, and date are all editable directly in the review interface.

Digital vs. scanned invoices

BillyBox handles both. For digitally generated invoices — the kind sent by SaaS vendors, cloud providers, and online services via email — it extracts the embedded text layer instantly and accurately. No OCR needed, no AI cost.

For scanned invoices and photos (receipts photographed with a phone, scanned utility bills, or images uploaded via drag-and-drop), BillyBox uses AI vision to read the document directly. The AI model analyzes the image, identifies the layout, and extracts structured data — issuer, amount, currency, date, invoice number — in a single pass. Each extracted field includes an AI confidence score so you can see at a glance which values may need verification.

You can also manually upload documents (PDF, XML, JPG, PNG) via drag-and-drop if you have invoices that didn't arrive by email — paper receipts, downloaded files, or invoices from portals that don't send email notifications.

Related articles

Try it free

BillyBox's free plan lets you process 2 months of invoices with 2 email connections. Connect your Gmail, Outlook, or any IMAP email, fetch a month, and see the extraction results in minutes — including AI-powered OCR for scanned documents. No credit card, no setup scripts, no vendor-specific configuration needed.

Related articles