Your AI Is Leaking: What Every Law Firm Needs to Know About PII

Right now, at law firms across the country, attorneys and paralegals are pasting depositions, contracts, medical records, and client correspondence into AI tools. ChatGPT. Claude. Copilot. Google Gemini. They are getting useful work done — draft summaries, clause analysis, research memos — and in the process, they are sending privileged client data to third-party servers with zero documentation of what left their network.

Most of them know this is a problem. They do it anyway, because the productivity gains are too significant to ignore.

That gap — between what firms are doing and what they can prove they did responsibly — is where the risk lives.

The Custody Problem

When an attorney pastes a section of a deposition transcript into a consumer AI tool, several things happen simultaneously. The text leaves the firm's network. It travels to a third-party server. The AI provider may or may not retain it, depending on their terms of service and data handling policies. And no record of this transaction exists on the firm's side.

There is no log of what was sent. No documentation of what client data was exposed. No audit trail that a compliance officer, malpractice carrier, or disciplinary board could review.

This is not a hypothetical concern. This is the daily workflow at a growing number of firms that have adopted AI tools without building controls around them.

The core issue is custody. Attorneys have an ethical obligation to maintain the confidentiality of client information. When privileged data leaves the firm's custody and enters a system the firm does not control, the attorney bears the burden of demonstrating that reasonable precautions were taken. With most consumer AI tools, there is nothing to demonstrate.

38 Types of Sensitive Data Hiding in Every Legal Document

The scale of potential exposure is larger than most firms realize. A single legal document can contain dozens of distinct PII types. When we built the detection engine for our Anonymizer product, we cataloged 38 categories of personally identifiable information that commonly appear in legal documents:

Identity data: Full names (including ALL CAPS formatting common in legal filings), dates of birth, Social Security numbers
Contact information: Physical addresses, email addresses, phone numbers
Legal identifiers: Case numbers, bar registration numbers, court docket numbers, filing references
Financial data: Bank account numbers, routing numbers, policy numbers, claim numbers, transaction amounts
Medical information: Physician names, license numbers, diagnoses, treatment dates, disability determinations
Tracking data: USPS tracking numbers, FedEx/UPS identifiers, certified mail receipts
Government identifiers: Driver's license numbers, passport numbers, tax ID numbers
Digital identifiers: Portal usernames, registration keys, account numbers, IP addresses

A single deposition transcript can contain a dozen of these categories. A discovery production can contain hundreds. Every time that content enters an AI tool without controls, every one of those data points is potentially exposed.

Why Redaction Fails

The instinct most firms have is redaction. Replace sensitive data with [REDACTED] before sending it to the AI. Problem solved.

Except it is not solved. Redaction creates a different problem: it degrades the AI's output.

When you replace "John Smith, age 47, residing at 1234 Oak Street" with "[REDACTED], age [REDACTED], residing at [REDACTED]," the AI loses context. It cannot reason about relationships between parties. It cannot track which person did what. It cannot distinguish the plaintiff from the defendant when both are labeled [REDACTED]. The output quality drops, sometimes dramatically.

This creates a frustrating cycle. The attorney redacts the document. The AI produces mediocre output because it lacks context. The attorney decides redaction is not worth the effort and goes back to pasting raw text. The risk returns.

Synthetic Replacement: A Better Approach

The alternative to redaction is synthetic replacement. Instead of removing data, you replace it with realistic but fabricated data. "John Smith" becomes "Thomas Baker." "1234 Oak Street, Nashville" becomes "5678 Elm Avenue, Memphis." Case number "2:24-cv-00775" becomes "3:25-cv-01192."

The AI receives a document that reads naturally. It can track parties, reason about relationships, and produce high-quality analysis. But none of the real identities are present. The document that reaches the AI server contains zero actual client data.

After the AI returns its analysis, the synthetic names are mapped back to the originals. The attorney gets the full, accurate output. The AI provider never saw the real data. And the entire transaction — what was detected, what was replaced, what was restored — is logged in an audit trail.

What a Defensible Workflow Looks Like

A defensible AI workflow for legal documents has five components:

Detection: Automated scanning that identifies every PII element in the document before it leaves the firm's network
Cataloging: A record of exactly what was found — names, addresses, case numbers, medical data — with their locations in the document
Replacement: Synthetic identities that preserve document structure and readability while eliminating real data
Restoration: Verified round-trip replacement after the AI returns its output, with confirmation that every synthetic element was correctly mapped back
Audit trail: A permanent log of the entire transaction — what was detected, what was replaced, what was sent, what was received, and what was restored

That fifth component is the one that matters most for compliance. The audit trail is what you show the malpractice carrier. It is what you present to the disciplinary board. It is the documentation that transforms "we think we handled it properly" into "here is the proof."

The Ethics Landscape Is Shifting

A growing number of state bars are examining how attorneys use AI tools in practice. The emerging consensus is not that AI should be avoided — it is that attorneys must exercise competent oversight of the technology they use, including understanding how client data is handled.

The practical implications are straightforward. Firms that can demonstrate a controlled, documented process for AI use are in a defensible position. Firms that cannot demonstrate any process are exposed — not because they used AI, but because they used it without safeguards.

Malpractice carriers are beginning to ask similar questions. As AI adoption accelerates, the underwriting conversation is shifting from "do you use AI" to "how do you use AI, and what controls are in place." Firms with answers to those questions will be in a stronger position than those without.

"Hope nothing goes wrong" is not a compliance strategy.

The Path Forward

The firms that will benefit most from AI are the ones that adopt it with controls in place from the start. Not the ones that avoid it entirely — that is a competitive disadvantage that compounds over time. And not the ones that use it without guardrails — that is a liability that compounds in a different way.

The practical middle ground is a system that sits between the firm and the AI, handling the sensitive work of stripping and restoring identities so that attorneys can focus on the legal analysis they were trained to do. That is what we built.