BillyBox is an invoice management tool that connects to your email (Gmail, Outlook, Zoho, or any IMAP provider), automatically extracts invoice attachments, and lets you classify them as business, personal, or ignored. Export an organized ZIP for your accountant in minutes.

Yes. BillyBox uses read-only access — we never send, delete, or modify your emails. Credentials are encrypted with AES-256, data is hosted in EU data centers (GDPR compliant), and you can revoke access with one click.

How does the invoice classification work?

After BillyBox fetches invoices from your email, you classify each one as business, personal, or ignore using keyboard shortcuts (B/P/I) on desktop or swipe gestures on mobile. It takes about 3 minutes instead of 30.

What email providers are supported?

BillyBox works with Gmail (Google OAuth), Outlook/Hotmail/Live (Microsoft OAuth), Zoho Mail (IMAP), and any email provider that supports IMAP with an app password (Yahoo, ProtonMail Bridge, custom domains, and more). You can connect multiple accounts at once.

Is there a free plan?

Yes! The free plan lets you fetch invoices from 2 calendar months with 2 email connections. No credit card required. Upgrade to Starter or Pro for unlimited months and more connections.

How is BillyBox different from accounting software?

BillyBox handles the step before accounting — collecting and organizing invoices from email. It does not replace QuickBooks, Xero, or Wave. Export your organized invoices and feed them into whatever accounting tool you use.

Can I send receipts from my phone?

Yes. Connect Telegram once in Settings, then forward photos, PDFs, or vendor portal links to the BillyBox bot. They land in the same review queue as your email invoices and follow the same classification flow. This complements your email connection — email still does the heavy lifting; Telegram is for the in-person receipts your inbox never sees.

What can I send to the BillyBox bot?

Photos (JPG/PNG) of paper receipts, PDF or XML invoices, and links to vendor portals (Stripe, AWS, hotel booking confirmations). The bot reads the document, returns the vendor and amount, and gives you Business / Personal / Ignore buttons right in the chat.

Can I add receipts or PDFs directly from the BillyBox app?

Yes — and you don't need to install anything. Open billybox.app in your phone's browser (it's a PWA, you can Add to Home Screen for an app-like icon) and tap the camera button in the review screen. It opens the rear camera so you can snap a paper receipt, or you can pick a PDF from your phone storage instead. The document goes through the same extraction and classification pipeline as your email invoices. Works alongside email and Telegram — pick whichever channel fits the document you have in hand.

How does PDF invoice extraction actually work?

BillyBox parses the PDF's text layer (the embedded text most modern PDFs contain) to pull vendor, amount, currency, date, invoice number and other fields. When the text layer is too sparse — scanned PDFs, image-only documents — it falls back to AI vision by rendering the first page as an image and running an OCR-style pass.

What fields are extracted?

The v3.5 schema (May 2026) extracts 14 fields per document: vendor, amount, currency, date, invoice number, due date, tax rate %, vendor country, vendor VAT number, service category, spend category, billing period start/end, product name, plus subscription signals (is_recurring_self_claim, next_billing_date, is_trial, cancellation_event).

Does it handle multi-page invoices?

Yes. BillyBox uses pdfjs-dist 5.6 (legacy build for Safari compatibility) with per-instance PagesMapper, so multi-page documents render correctly and AI extraction reads relevant pages. A regression in 5.4 that broke page 2 was fixed in the 5.6 upgrade.

What if extraction misses an amount?

You can edit any extracted field inline in the review queue. The change overrides the AI value and goes into the export as-is. The AI suggestion is preserved alongside in ai_suggested_classification for telemetry, but your edited value is what your accountant sees.

PDF invoice extraction: how to extract data from invoice PDFs automatically

March 2026

Extracting data from PDF invoices is one of the most tedious tasks in accounting. Every vendor generates PDFs differently — different layouts, fonts, languages, and structures. What looks simple to the human eye (read the total, find the vendor name) is surprisingly hard to automate. This article explains how PDF invoice extraction works, why it's difficult, and how modern tools solve it.

Why PDF invoice extraction is hard

PDFs were designed for visual display, not data extraction. Unlike a spreadsheet or JSON file, a PDF doesn't label its fields. The "total" is just text positioned at certain coordinates on the page. There is no semantic tag that says "this is the invoice amount."

No consistent layout — every vendor puts amounts, dates, and totals in different positions
Scanned vs. digital — scanned invoices are images, not text. You need OCR before any extraction
Multilingual invoices — "Total", "Gesamt", "Toplam", "Total de plată" all mean the same thing
Currency ambiguity — "1.234,56" means 1234.56 EUR in Germany but something else in the US
Multiple amounts — subtotal, tax, shipping, discount, total, amount due — which one matters?
Embedded fonts and encodings — some PDFs use custom fonts that break standard text extraction

How PDF text extraction works

The first step in any PDF invoice parser is getting the raw text. Digital PDFs (the kind generated by billing software) have an embedded text layer. Tools like pdfplumber or PyMuPDF can extract this text along with its position on the page.

For scanned invoices (photos or scanned paper documents), traditional OCR tools like Tesseract or cloud APIs from Google/AWS convert the image to text first. But a newer approach has emerged: AI vision models. Instead of a separate OCR step followed by text parsing, AI vision models like AI can look at a scanned PDF or photo directly and extract structured data — vendor name, amount, currency, date — in a single pass. This is often more accurate than separate OCR-then-parse pipelines because the model understands document layout, not just character shapes. The same AI vision capability handles photographed paper receipts from local vendors — extracting vendor, amount, and date directly from the photo without a separate OCR pass.

The good news: most invoices from SaaS vendors, cloud providers, and online services are digitally generated. They have clean text layers that can be extracted instantly and accurately — no OCR needed. And for the rest (scanned receipts, photographed invoices), AI vision handles them with high accuracy.

Finding the right numbers: amount and currency detection

Once you have the raw text, the real challenge begins: figuring out which number is the invoice total. A typical invoice PDF contains dozens of numbers — dates, quantities, unit prices, tax percentages, subtotals, and the final amount.

Smart extraction uses a combination of techniques:

Keyword proximity — look for numbers near "Total", "Amount Due", "Grand Total" in multiple languages
Regex patterns — match currency-specific formats like €1,234.56, 1.234,56 EUR, $500.00, or 2.500,00 RON
Position heuristics — totals tend to appear in the bottom third of the page, right-aligned
Currency symbol detection — identify €, $, £, RON, lei, and other currency markers near amounts
Format-aware parsing — correctly interpret European (1.234,56) vs. US (1,234.56) number formats

Vendor identification

Knowing which vendor issued the invoice makes extraction dramatically easier. If you know it's a Hetzner invoice, you know exactly where to look for the total and what format to expect. This is why vendor registries are so valuable.

Vendor identification works by matching the sender's email domain (billing@stripe.com), the PDF's metadata, or recognizable text patterns in the document itself (company names, VAT numbers, known layouts). Once the vendor is identified, a vendor-specific extraction profile can be applied for higher accuracy.

Approaches to invoice data extraction compared

Manual tools: Tabula, Camelot, pdfplumber

Open-source libraries that extract tables from PDFs. They work well for structured, table-heavy invoices but require manual configuration per vendor. You write the extraction rules yourself. Great for one-off scripts, impractical once you're juggling invoices from dozens of vendors with different layouts.

Best for: developers building custom pipelines for a single vendor format.

Enterprise OCR: ABBYY, Kofax, Rossum

Enterprise-grade solutions with AI-powered extraction. They handle scanned documents, handwriting, and complex layouts. But they're priced for enterprise AP departments, typically custom quotes, with setup and training required.

Best for: large enterprises with high-volume accounts payable departments.

Cloud AI APIs: Google Document AI, AWS Textract

Pay-per-page APIs that extract structured data from documents. More accessible than enterprise solutions but still require integration work — you need to build the pipeline that sends PDFs to the API, handles responses, maps fields, and stores results. Per-page pricing adds up with volume.

Best for: developers comfortable with API integration who need a generic solution.

BillyBox: Automated extraction from email

BillyBox takes a different approach: instead of giving you an extraction tool, it handles the entire pipeline. Connect your Gmail, Outlook, Zoho, or any IMAP email, and BillyBox fetches invoices, uses AI to filter out non-invoices, extracts data from both digital and scanned PDFs (via AI vision OCR), downloads invoices from billing portals linked in emails, and generates PDF receipts from email content when no attachment exists. You can also manually upload PDFs, XML files, or photos. No configuration per vendor. No API integration.

Best for: freelancers and small businesses who want results, not a tool to configure.

How BillyBox extracts invoice data

BillyBox combines multiple extraction strategies to handle the diversity of real-world invoices:

AI classification gate — before extraction even begins, an AI model (AI) analyzes each attachment to determine if it's a real invoice. Logos, marketing PDFs, shipping labels, and other noise are automatically filtered out. Only real invoices proceed to data extraction.
AI-powered OCR for scanned documents — when a PDF contains scanned images instead of text (or when you upload a photo of a receipt), BillyBox uses AI vision to extract data directly from the image. No separate OCR step — the AI model reads the document visually and returns structured fields with confidence scores.
AI-driven data extraction — the actual fields (vendor, amount, currency, tax, dates, invoice number, and more) are pulled out by an AI model (configurable backend: OpenAI or Anthropic Claude) running a structured prompt over the document text. No hand-written regex per vendor — the same prompt handles AWS, Stripe, your local utility company, and a random one-off invoice equally well.
50+ known invoice senders for pre-filtering — a curated list of common invoice domains (Stripe, AWS, DigitalOcean, Hetzner, Google Cloud, Anthropic, etc.) lets BillyBox skip cheap-rule checks and forward portal links to the actual PDF before AI extraction runs.
Currency & number-format awareness — the AI prompt is tuned for European, US, and mixed number formats, and a deterministic post-processor validates candidate amounts by proximity to keywords like "Total" and "Amount Due" in 10+ languages.
Multi-currency support — EUR, USD, RON, GBP, and other currencies are detected from symbols, ISO codes, and contextual text. European comma-decimal formats are handled correctly.
XML e-invoice parsing — many European invoices use structured XML formats (UBL, CII). BillyBox parses these directly for exact, structured values — no guessing.
Email context — the sender domain, subject line, and email body provide additional signals for vendor identification and amount verification.
Portal invoice download — when an email contains a link to download an invoice from a billing portal (Stripe dashboard, AWS billing, FreshBooks), BillyBox detects the URL and fetches the actual PDF. If automatic download isn't possible, the link is shown so you can grab it manually.
Receipt generation from email content — when there's no PDF attachment and no downloadable link, BillyBox generates a clean PDF receipt from the email body, extracting vendor, amount, date, and formatting it as a proper document. Payment confirmations from services like PayPal or utility providers become real PDF receipts automatically.

What gets extracted

For each invoice, BillyBox extracts and displays:

Issuer / vendor name— e.g., Hetzner, AWS, Notion

Invoice amount— e.g., €49.90, $127.00

Currency— EUR, USD, RON, GBP

Invoice number— e.g., INV-2026-0384

Invoice date— from PDF or email date

Issued to— recipient name or company

Description— line item summary

Tax amount— VAT / sales tax extracted separately

Subtotal— pre-tax amount when available

Sender email— original email address

All extracted data is shown alongside a PDF preview so you can verify at a glance. If extraction misses or misreads something, you can edit any field inline — issuer, amount, currency, and date are all editable directly in the review interface.

Digital vs. scanned invoices

BillyBox handles both. For digitally generated invoices — the kind sent by SaaS vendors, cloud providers, and online services via email — it extracts the embedded text layer instantly and accurately. No OCR needed, no AI cost.

For scanned invoices and photos (receipts photographed with a phone, scanned utility bills, or images uploaded via drag-and-drop), BillyBox uses AI vision to read the document directly. The AI model analyzes the image, identifies the layout, and extracts structured data — issuer, amount, currency, date, invoice number — in a single pass. Each extracted field includes an AI confidence score so you can see at a glance which values may need verification.

You can also manually upload documents (PDF, XML, JPG, PNG) via drag-and-drop if you have invoices that didn't arrive by email — paper receipts, downloaded files, or invoices from portals that don't send email notifications.

Try it free

BillyBox's free plan lets you process 2 months of invoices with 2 email connections. Connect your Gmail, Outlook, or any IMAP email, fetch a month, and see the extraction results in minutes — including AI-powered OCR for scanned documents. No credit card, no setup scripts, no vendor-specific configuration needed.

Try BillyBox Free See Pricing