Skip to main content
Founder-led. Built in Sydney.

Pre-labelled synthetic documents for healthcare and insurance extraction

Every document ships with ground truth, bounding boxes and a scanned variant. Built for teams shipping document AI in healthcare and insurance.

Free preview
Same-day delivery
40+
Medical document types
Per-row labels
Loss runs and SOVs
AU healthcare
NSW postcodes, Medicare format
Deterministic by seed
Ground truth + bounding boxes
Scanned variants for every PDF
Visible synthetic disclaimer on every page
Australian document conventions

Three buyer profiles

ML engineers, procurement leads, data platform teams. Different jobs, same RCA libraries.

ML engineers

Shipping healthcare or insurance extraction to production

You need labelled training data and a regression suite that catches regressions before they hit a customer document. RCA libraries give you pre-labelled PDFs, ground truth, bounding boxes, and scanned variants in one bundle.

Procurement & QA leads

Evaluating extraction vendors against a real-world target

You need every vendor to score against the same documents and the same ground truth. The Insurance QA Sprint Pack ships 10 complete submission packs with engineered red flags. Same input, same target, fair comparison.

Data platform teams

Deploying document extraction inside your own environment

RCA Extract runs as a self-hosted Docker container in your cloud or on-prem. Zero data egress. Customer-managed compute and costs. Your existing RBAC and audit policies apply.

Real samples, not mock-ups

Real pages from the libraries

Real pages from the RCA Insurance and Medical libraries. Same generator stack, different document types. Every page ships with ground truth, bounding boxes, a scanned variant, and a visible synthetic disclaimer.

RCA Insurance LibraryBroker submission email
1 / 5
RCA Insurance Library sample: Broker submission emailRCA Insurance Library sample: Loss run reportRCA Insurance Library sample: Statement of valuesRCA Insurance Library sample: Policy scheduleRCA Insurance Library sample: First notice of loss
RCA Medical LibraryDischarge summary
1 / 5
RCA Medical Library sample: Discharge summaryRCA Medical Library sample: ED assessmentRCA Medical Library sample: Referral letterRCA Medical Library sample: Imaging reportRCA Medical Library sample: Pathology report
RCA Insurance Library
Complete commercial P&C submission packs. Broker email, loss run, statement of values, policy schedule, certificate of currency, application, FNOL, claim report.
RCA Medical Library
40+ document types across hospital, ED, GP clinic, pathology, imaging and specialist correspondence. NSW conventions throughout.
Ships alongside every PDF
CSV + JSONL ground truth, bboxes.jsonl with labelled fields, manifest, scanned variant.
Request a preview pack

Free. Same-day delivery. Two complete insurance packs or 25 to 35 medical documents, with ground truth and scanned variants.

Why teams pick the RCA libraries

Six properties that matter for QA, training and procurement evaluation.

Real-looking PDFs, real variety

Visually varied across eight style profiles and three template families per document type. Not one template repeated.

Ground truth shipped with every document

CSV and JSONL. Bounding boxes on every labelled field. Insurance documents add per-claim and per-location row entries, each row with its own bbox.

Scanned variants for the photocopy path

Every clean PDF ships with a rotated, noised, JPEG-compressed scanned variant alongside it. Same ground truth, harder input.

Reproducible by seed

The same seed produces the same PDFs every time. Useful for versioned QA, regression suites, and procurement evaluation.

Safe to share inside your company

Every page carries a visible synthetic disclaimer. Nothing is real patient, claimant, broker, or policyholder data. No data agreement required to redistribute internally.

Direct delivery, no third parties

Libraries ship as direct downloads. No third-party file-share processor unless requested. RCA Extract runs as a self-hosted Docker container with zero data egress.

How the libraries are built

A deterministic Python generator, curated case files and phrase banks. No LLM calls in the default pipeline.

01

Curated case files

Hand-authored case archetypes. Phrase banks. Field schemas defined up front.

02

Deterministic generator

A Python pipeline turns a case file plus a seed into a fully-rendered PDF. Same seed, same output, every time.

03

Labels generated alongside

Ground truth, bounding boxes and scanned variants are produced in the same pass as the PDFs.

04

Library packaging

Library ships with manifest, splits, README and a written synthetic safety statement.

Synthetic by design. Safe to share.

Every document we ship is computer-generated. The names, ABNs, Medicare numbers, addresses, phone numbers, policy numbers, claim numbers and dollar values are all synthetic. Every PDF page carries a visible synthetic disclaimer.

  • Not for clinical, claims, underwriting, regulatory, accounting or legal use.
  • RCA Extract runs as a self-hosted Docker container inside your environment. Zero data egress. Inherits your existing RBAC, audit and access policies.
  • Library deliveries are direct downloads. No third-party data processors involved.
  • No real customer or patient data is held, transmitted or stored anywhere in the generator or the libraries.
Synthetic disclaimer on every page
No real customer data anywhere
RCA Extract runs in your own environment
Direct delivery, no third parties

Try the libraries in five minutes

Request the free 2-pack insurance preview or a 25 to 35 document medical review pack.

No credit card. Same-day delivery. Each pack ships with a README_START_HERE.md and a recommended five-minute review path.

Built and supported from Sydney, Australia. More about Root Cause Analytics