Skip to content
Gradion
|Engineering

Why you cannot upload client documents to ChatGPT (and how to use AI anyway)

Ivor Padilla

by Ivor Padilla

Co-Founder & Engineering Director

Why you cannot upload client documents to ChatGPT (and how to use AI anyway)

By Ivor Padilla, co-founder of Gradion · 8 min read


The Spanish search chatgpt abogados usually means one practical question: can a firm use a powerful cloud model without exposing client confidentiality? The answer is not "never use AI". The answer is: never send the raw client file upstream. If the PDF contains names, addresses, DNI/NIF/NIE/CIF numbers, reference codes or document metadata, the cloud call is already inside your personal-data perimeter.

The fix is architectural, not rhetorical. OCR runs locally. PII detection runs locally. Reversible anonymisation runs locally. Only then does the cloud LLM see the text, and what it sees is <PERSONA_1>, <ORGANIZATION_1> and <ADDRESS_1>, not the real client data.

TL;DR: a usable law-firm AI stack is not "ChatGPT plus a policy". It is a boundary: (1) local OCR, (2) local Spanish PII detection, (3) local reversible anonymisation, (4) cloud inference on redacted text only, followed by local de-anonymisation. The model gets semantics, not identity.

Why raw client files cannot go to a cloud LLM

Once the real file leaves the firm's environment, the provider is no longer "just a tool". It is part of the processing chain.

If a law firm uses a cloud provider to process personal data on the firm's behalf, Article 28 GDPR applies: that provider is acting as a data processor, and the relationship must be governed by a contract setting out the object, duration, purpose, data categories and each party's obligations. For raw client files, that contract is not paperwork around the edge of the workflow; it is part of the minimum lawful setup.

The AEPD's guidance on GDPR compliance for AI-based processing is explicit: if a data subject's data is sent to third parties in order to run or refine the AI component, that is a disclosure of data and may also involve storage or model-modification processing. Once the real client file is sent upstream, you are no longer "just using a tool" — you are operating a third-party data flow that has to stand up to scrutiny.

That is why the right implementation question is not "which model is best?" but what exact text is allowed to leave the machine?

The pipeline, step by step

The key is not the model choice. It is the order of operations. Each stage hands only what the next stage strictly needs — and no stage sends identifiable text to a remote endpoint.

Installation

pip install langchain langchain-core langchain-openrouter langchain-ollama liteparse python-dotenv
pip install presidio-analyzer presidio-anonymizer spacy pydni
python -m spacy download es_core_news_sm
# Ollama with gemma4:e4b and lightonocr-2-1b loaded

Configuration

from presidio_analyzer import (AnalyzerEngine, RecognizerRegistry,
    EntityRecognizer, RecognizerResult, Pattern, PatternRecognizer)
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from langchain.agents import create_agent
from langchain.agents.middleware import PIIMiddleware
from langchain_core.messages import HumanMessage
from liteparse import LiteParse
from langchain_openrouter import ChatOpenRouter
from langchain_ollama import ChatOllama
from pydantic import BaseModel, Field
from PyDNI import verificar_dni, verificar_nie, verificar_cif
import base64, os, re

OLLAMA_URL  = "http://localhost:11434"
OCR_MODEL   = "lightonocr-2-1b"
LLM_MODEL   = "gemma4:e4b"
NOTA_SIMPLE = "document.pdf"  # your PDF here

Step 1 — Local OCR

LiteParse renders the PDF page to an image. That image goes as base64 to lightonocr-2-1b through Ollama's OpenAI-compatible endpoint — no document leaves the machine.

parser = LiteParse()
screenshot = parser.screenshot(NOTA_SIMPLE, dpi=200, load_bytes=True)
b64 = base64.b64encode(screenshot.get_page(0).image_bytes).decode()

ocr_llm = ChatOllama(
    model=OCR_MODEL,
    base_url=OLLAMA_URL,
    temperature=0,
    num_ctx=16384,
    num_predict=16384,
)

msg = HumanMessage(content=[
    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
    {"type": "text", "text": "Transcribe all text in the document. Preserve the original structure."},
])

response = ocr_llm.invoke([msg])
content = response.content
print(f"OCR: {len(content)} characters extracted")

Step 2 — PII detection built for Spain

Presidio coordinates entity recognition across three layers. Each layer does something different; all three are necessary to cover the PII surface of a Spanish legal document.

SpanishIDRecognizer — DNI, NIE and CIF

Most generic guides detect PII with an email regex and a phone regex. For a Spanish law firm, that is not enough. This recogniser targets DNI, NIE and CIF — and does not stop at the pattern: it calls pydni to validate the control digit. An identifier that looks like a DNI but fails the checksum does not get intercepted.

class SpanishIDRecognizer(EntityRecognizer):
    def __init__(self):
        super().__init__(supported_entities=["ES_DNI", "ES_NIE", "ES_CIF"], supported_language="es")
        self.dni_re = re.compile(r"\b\d{8}[\s\-]?[A-Z]\b")
        self.nie_re = re.compile(r"\b[XYZ][\s\-]?\d{7}[\s\-]?[A-Z]\b")
        self.cif_re = re.compile(r"\b[A-HJ-NP-SUVW][\.\ -]?\d{2}[\.\ -]?\d{6}\b|\b[A-HJ-NP-SUVW]-?\d{7}-?[A-Z0-9]\b")
    def load(self): pass
    def _clean(self, v): return v.upper().replace("-","").replace(" ","").replace(".","")
    def analyze(self, text, entities, nlp_artifacts=None, regex_flags=None):
        results = []
        for m in self.dni_re.finditer(text):
            if verificar_dni(self._clean(m.group(0))):
                results.append(RecognizerResult("ES_DNI", m.start(), m.end(), 1.0))
        for m in self.nie_re.finditer(text):
            if verificar_nie(self._clean(m.group(0))):
                results.append(RecognizerResult("ES_NIE", m.start(), m.end(), 1.0))
        for m in self.cif_re.finditer(text):
            if verificar_cif(self._clean(m.group(0))):
                results.append(RecognizerResult("ES_CIF", m.start(), m.end(), 1.0))
        return results

PIIEntities — structured output schema

To make the local LLM return structured results rather than free text, we define the Pydantic schema first. Gemma will be required to return exactly these four lists — no free-text parsing, no post-processing.

class PIIEntities(BaseModel):
    personas:       list[str] = Field(default_factory=list, description="Full names of persons")
    organizaciones: list[str] = Field(default_factory=list, description="Company/organization names")
    direcciones:    list[str] = Field(default_factory=list, description="Full postal addresses")
    ubicaciones:    list[str] = Field(default_factory=list, description="Cities, provinces, countries")

GemmaLLMRecognizer — the LLM detects what regex cannot

Structured identifiers (DNI, email, phone) are handled well by regex. Persons, organisations and addresses are a different problem: they depend on context. "María López García" is PII; "Madrid" usually is not. This recogniser calls Gemma on Ollama with with_structured_output(PIIEntities): the LLM returns the four schema categories directly, with no free text to parse.

class GemmaLLMRecognizer(EntityRecognizer):
    LABEL_MAP = {"personas":"PERSON","organizaciones":"ORGANIZATION",
                 "direcciones":"ADDRESS","ubicaciones":"LOCATION"}
    SYSTEM_PROMPT = (
        "Extract all personally identifiable information (PII) from the text. "
        "Only real names of persons, companies, locations and postal addresses. "
        "Do not include legal terms, technical codes or formulas. "
        "Ignore HTML tags and markdown syntax."
    )
    def __init__(self):
        super().__init__(supported_entities=list(set(self.LABEL_MAP.values())), supported_language="es")
        self._cache = {}
        self._llm = None

    def _get_llm(self):
        if self._llm is None:
            self._llm = ChatOllama(
                base_url=OLLAMA_URL,
                model=LLM_MODEL, temperature=0, num_ctx=16384,
            ).with_structured_output(PIIEntities)
        return self._llm

    def load(self): pass

    def _clean_html(self, t):
        t = re.sub(r"<[^>]+>", " ", t)
        t = re.sub(r"!\[.*?\]\(.*?\)", "", t)
        t = re.sub(r"\[.*?\]\(.*?\)", "", t)
        return re.sub(r"\s+", " ", t).strip()

    def _call_llm(self, text):
        key = hash(text)
        if key in self._cache: return self._cache[key]
        try:
            result = self._get_llm().invoke([
                {"role": "system", "content": self.SYSTEM_PROMPT},
                {"role": "user",   "content": self._clean_html(text)},
            ])
            data = result.model_dump()
        except Exception:
            data = {}
        self._cache[key] = data
        return data

    def analyze(self, text, entities, nlp_artifacts=None, regex_flags=None):
        data = self._call_llm(text)
        results, seen = [], set()
        for key, mapped in self.LABEL_MAP.items():
            if mapped not in entities: continue
            for value in data.get(key, []):
                if not isinstance(value, str) or len(value) < 4 or value in seen: continue
                seen.add(value)
                idx = text.find(value)
                if idx != -1:
                    results.append(RecognizerResult(mapped, idx, idx + len(value), 0.85))
        return results

Pattern recognisers and Presidio engine

Three pattern recognisers for email, phone and postcode. AnalyzerEngine registers all five recognisers and coordinates them on every call to analyzer.analyze().

email_rec = PatternRecognizer(supported_entity="EMAIL_ADDRESS", supported_language="es",
    patterns=[Pattern("email", r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", 0.9)])
url_rec = PatternRecognizer(supported_entity="URL", supported_language="es",
    patterns=[Pattern("url", r"(?:https?://|www\.)[^\s,]+", 0.5)])
phone_rec = PatternRecognizer(supported_entity="PHONE_NUMBER", supported_language="es",
    patterns=[Pattern("phone_es", r"\b(?:\+34[\s.-]?)?(?:\d{3}[\s.-]?\d{3}[\s.-]?\d{3}|\d{2}[\s.-]?\d{3}[\s.-]?\d{2}[\s.-]?\d{2})\b", 0.7)])
cp_rec = PatternRecognizer(supported_entity="POSTAL_CODE", supported_language="es",
    patterns=[Pattern("cp_es", r"\b(?:0[1-9]|[1-4]\d|5[0-2])\d{3}\b", 0.1)],
    context=["CP", "C.P.", "código postal", "Madrid", "Barcelona", "Sevilla", "Valencia"])

registry = RecognizerRegistry(supported_languages=["es"])
registry.add_recognizer(SpanishIDRecognizer())
registry.add_recognizer(GemmaLLMRecognizer())
registry.add_recognizer(email_rec)
registry.add_recognizer(url_rec)
registry.add_recognizer(phone_rec)
registry.add_recognizer(cp_rec)

nlp_config = {"nlp_engine_name": "spacy",
              "models": [{"lang_code": "es", "model_name": "es_core_news_sm"}]}
analyzer = AnalyzerEngine(
    registry=registry,
    nlp_engine=NlpEngineProvider(nlp_configuration=nlp_config).create_engine(),
    supported_languages=["es"],
)

Step 3 — Reversible local anonymisation

Detected entities are replaced with stable, numbered tokens. The key word is reversible: the mapping lives locally, so the model preserves logical relationships in the document while the real values stay on the machine.

class ReversibleAnonymizer:
    """Anonymise with numbered tokens and keep the mapping for reversal."""
    def __init__(self):
        self.mapping   = {}   # token  -> original value
        self._counters = {}   # entity_type -> counter
        self._seen     = {}   # original value -> token

    def _get_token(self, entity_type, original):
        if original in self._seen: return self._seen[original]
        count = self._counters.get(entity_type, 0) + 1
        self._counters[entity_type] = count
        token = f"<{entity_type}_{count}>"
        self.mapping[token] = original
        self._seen[original] = token
        return token

    def anonymize(self, text, analyzer_results):
        # Right-to-left to preserve offsets  
        sorted_r = sorted(analyzer_results, key=lambda x: x.start, reverse=True)
        out = text  
        for r in sorted_r:
            original = text[r.start:r.end]
            token    = self._get_token(r.entity_type, original)
            out      = out[:r.start] + token + out[r.end:]
        return out  

    def deanonymize(self, text):
        out = text
        for token, original in self.mapping.items():
            out = out.replace(token, original)
        return out

rev_anonymizer = ReversibleAnonymizer()

# Detect and anonymise
results         = analyzer.analyze(text=content, language="es", score_threshold=0.25)
anonymized_text = rev_anonymizer.anonymize(content, results)

print(f"Detected {len(results)} entities")
for token, original in rev_anonymizer.mapping.items():
    print(f"  {token} -> {original}")

In practice, with a fragment from a Spanish property document:

Titular: María López García, DNI 12345678Z, actuando en representación de
Bufete Martínez & Asociados S.L. (CIF B58492031), con domicilio en
Calle Gran Vía, 28, 4.º B, 28013 Madrid.
Contacto: m.lopez@bufete-ma.es  ·  +34 612 345 678

The pipeline detects the entities and builds the local mapping:

Detectadas 6 entidades
  <PERSON_1>        -> María López García
  <ES_DNI_1>        -> 12345678Z
  <ORGANIZATION_1>  -> Bufete Martínez & Asociados S.L.
  <ES_CIF_1>        -> B58492031
  <ADDRESS_1>       -> Calle Gran Vía, 28, 4.º B, 28013 Madrid
  <EMAIL_ADDRESS_1> -> m.lopez@bufete-ma.es
  <PHONE_NUMBER_1>  -> +34 612 345 678

That mapping stays local — the provider never sees it. What leaves the machine is:

Titular: <PERSON_1>, DNI <ES_DNI_1>, actuando en representación de
<ORGANIZATION_1> (CIF <ES_CIF_1>), con domicilio en <ADDRESS_1>.
Contacto: <EMAIL_ADDRESS_1>  ·  <PHONE_NUMBER_1>

Step 4 — Cloud LLM on redacted text, then local de-anonymisation

Only now does a cloud model get called. PIIMiddleware acts as a final guardrail — if anything slips through before the request, it gets redacted. But the primary boundary is still the local anonymisation above it, not the middleware.

model = ChatOpenRouter(
    model="z-ai/glm-5.1",
    temperature=0.8,
    api_key=os.environ.get("OPENROUTER_API_KEY"),
)

def detect_pii(text: str) -> list[dict]:
    hits = analyzer.analyze(text=text, language="es", score_threshold=0.25)
    return [{"text": text[r.start:r.end], "start": r.start, "end": r.end} for r in hits]

agent = create_agent(
    model=model,
    tools=[],
    middleware=[PIIMiddleware("pii", detector=detect_pii, strategy="redact")],
)

result = agent.invoke({
    "messages": [
        {"role": "system", "content": "Analyse the anonymised document. Identify risks and inconsistencies."},
        {"role": "user",   "content": anonymized_text},
    ]
})

# De-anonymise before returning to the team
llm_response = result["messages"][-1].content
deanonymized = rev_anonymizer.deanonymize(llm_response)
print(deanonymized)

The pipeline does not route around GDPR. It implements it by design.

Article 32 GDPR is not a rule about whether AI is allowed; it is a rule about appropriate technical measures. The text expressly lists pseudonymisation and encryption, and the ability to ensure ongoing confidentiality, integrity, availability and resilience. A design where the cloud model never sees real identifiers is materially closer to that standard than one that sends the raw file upstream.

The same AEPD guidance makes the minimisation principle in Article 5(1)(c) GDPR operational for AI: data must be adequate, relevant and limited to what is necessary, and in AI that translates into anonymisation and pseudonymisation not only when data is disclosed but also in training, in the model itself and in inference. Redacting the text before the LLM call is a direct application of minimisation, not cosmetic compliance.

The AEPD's audit guide for AI processing goes one step further: minimisation criteria must be applied at each stage of the AI component, using masking, separation, abstraction, anonymisation and pseudonymisation. That is precisely the shape of this workflow — four discrete stages, each with its own trust boundary.

Frequently asked questions

Far more defensibly than a raw file, because the provider is no longer seeing real identifiers. That does not remove every compliance duty around the provider, but it radically changes what data is actually exposed upstream.

Why is pydni better than a generic regex for Spanish IDs?

Because it validates the control digit. A regex tells you a string resembles a DNI or NIE. pydni tells you whether it is structurally valid — which is the threshold for deciding it needs to be blocked before the cloud call.

Where should the de-anonymisation mapping live?

Locally, under the firm's own controls. The whole design depends on the provider never having access to that lookup table.

Does PIIMiddleware replace local redaction?

No. It is a useful guardrail layer but it is not the primary trust boundary. The primary boundary is local redaction before the cloud request exists at all. If you rely on middleware to clean up a raw prompt at the edge, you are trusting the last checkpoint to do the work that should have been done several steps earlier.

How Gradion implements this

The first deliverable in this kind of project is not model selection. It is a field-level boundary: which identifiers can never leave the local environment, which recognisers are needed for the firm's document mix, where the reversible mapping lives, and what logs the firm needs around the handoff point.

In the projects we have delivered so far, the first blocker is rarely OCR or prompting. It is the missing boundary between identifiable text and model-facing text. The firm does not usually need "more AI". It needs a better-designed line between the real client file and the reasoning layer. When that line exists, the team keeps reviewing, deciding and signing. The AI removes the repetitive document work. It does not replace judgement.

Is your team spending hours on manual document work? Tell us about your workflow →