Back to Tool

Documentation & User Guide

Secure, 100% local PDF extraction & smart file identification for EA & CPA professionals.

Security & Privacy

100% Local Processing

All PDF parsing, OCR, and file identification happens exclusively inside your browser tab. PDF Master for EA & CPA only delivers static HTML/JS files — it has no mechanism to receive, read, or store your data.

  • Server-Zero Knowledge: The remote server provides only static files. It never sees your documents.
  • Browser Isolation: Once the page is loaded, all logic runs offline. You can disconnect from the internet and everything still works.
  • No Cloud APIs: OCR, file detection, and extraction never call any external API or service.

How to verify?

  1. Open this tool in your browser.
  2. Disconnect your computer from the internet (Toggle Wi-Fi off).
  3. Upload a PDF and run an extraction.
  4. Everything works offline — proof that no data leaves your device.

PDF Extraction Core Feature

Extract structured data from PDF documents — including tax returns, brokerage statements, financial reports and IRS forms — directly into Excel. Designed for CPA, EA and financial professionals who need accurate, private extraction with zero cloud dependency. All processing stays in your browser.

1. Quick Extraction (Default)

Reads the embedded text layer of digital PDFs (e.g. exported from Excel, QuickBooks, tax software, or Word). 100% accurate for numeric values — it reads encoded strings, not pixel patterns. Ideal for brokerage statements, K-1s, balance sheets, and any machine-generated PDF.

  2. Column Range Marking — The #1 Recommended Method

Drag horizontally across the PDF preview panel to select column ranges. The tool groups text into columns based on its physical X-position on the page.

As you define each range, the left-side Text panel updates in real time — showing a preview of column-separated output that closely matches the final Excel export. What you see is what you get.

Why this matters for CPA & EA work:

  • Fixes Power Query column misalignment: Power Query's PDF connector frequently places numbers from one column into the wrong cell. Column Range Marking uses pixel-based boundaries, eliminating this guesswork entirely.
  • Fixes Adobe Acrobat merged cells: Adobe's PDF-to-Excel converter merges cells and collapses table structure. This tool produces clean, individual cells with correct column placement.
  • Visual confirmation before export: You can see the column split in the live preview — highly likely consistent with the final export.
  • Handles Non-Standard Grids: Perfect for dense multi-page brokerage statements, trial balances, and comparative income statements.

  Strongly recommended for all multi-column financial documents: brokerage statements, trial balances, comparative income statements, K-1 schedules, and bank statements. This is the most CPA/EA-friendly PDF extraction method available entirely in a browser.

3. Column Alignment Mode Effects

Each marked column has an alignment mode that affects how text within that boundary is captured. Four alignment options are available; we suggest trying different modes and comparing the live Text panel preview to find the best result for your specific document structure.

Strict (Default)
Enforces rigid boundaries. Text exceeding the left or right limits will be truncated to fit.
Left
Anchored to the left edge. Long strings are allowed to bleed past the right boundary.
Center
Aligned by the center point. Best for centered headers or single dates.
Right
Anchored to the right edge. Long strings are allowed to bleed past the left boundary.

4. Column Data Types (Formatting)

Select the expected data type for each column to apply automatic cleaning and formatting during export:

Str (String)
Default mode. Captures raw text exactly as it appears. Ideal for descriptions and names.
Num (Numeric)
Automatic cleaning. Removes currency symbols ($), commas, and spaces, preserving only digits.
Date
Attempts to harmonize varied date formats into a standard structure for clean Excel import.

5. Page Range — Highest Priority Control

Page Range overrides everything

The Page Range (Start / End page) has the highest authority in the extraction pipeline. It will only read the specified range.

6. Advanced Mode — Row-Level Control

Enable Advanced to reveal row filtering controls. Useful for isolating a specific table or removing repetitive noise rows.

Start Row Keyword
Define a keyword (within the specified page range) with match modes like "Starts With" or "Equal". Extraction only begins from the first matching row.
End Row Keyword
Extraction stops at the End keyword. Together with Start keyword, it isolates specific tables. Supporting multiple segments if keywords reappear.
Content Filters (Skip Rows)
Specify keywords to exclude specific noisy rows (like repeating headers or footers). Empty rows are always removed automatically.

7. Supplemental: Local OCR (Scanned Documents)

For scanned paper records or photo images of documents. OCR runs entirely in the browser using Tesseract.js. Subsequent runs are fully offline after initial language pack download (~10MB).

OCR Limitations — Please Read

  • English only: This tool is pre-configured with the English language pack only. OCR on documents in other languages will produce unreliable results.
  • Accuracy is limited: Tesseract.js is a client-side JavaScript OCR engine. Its recognition accuracy is significantly lower than server-side or AI-powered solutions (e.g. Adobe Acrobat, Google Document AI). Complex layouts, small fonts, low-resolution scans, and handwriting will further reduce accuracy.
  • Performance: Browser-based OCR is much slower than server-side engines. Large or multi-page documents may take considerable time to process.
  • Always verify: Treat OCR output as a draft. Manually review and correct all extracted figures before any professional use.
  • First-run download: Tesseract language data (~10 MB) is cached locally on first use. Internet is only required for this one-time download.

CPA/EA Utility Toolbox

Smart File Identification

EA and CPA professionals often receive client files that have lost their extensions. This tool uses deep binary signatures to automatically identify the 50+ formats and restore the correct file extensions. 100% local and private.

.xlsx
.xls
.csv
.numbers
.pdf
.docx
.doc
.pages
.pptx
.key
.eml
.msg
.html
.json
.jpg
.png
.heic
.zip
.mp3
.mp4

Apple HEIC to JPG Converter

Most CPA and EA professionals use Windows, which cannot natively open Apple HEIC photos. This tool provides a secure, 100% browser-based way to automatically convert HEIC to JPG locally. No data ever leaves your device.