PII Detection and Masking in LLM Pipelines for Production

How to detect and mask PII before it reaches your LLM and leaks in responses. Covers Microsoft Presidio, regex detection for Indian data (Aadhaar, PAN), token-based masking, and audit logging.

Every DevOps team building LLM features eventually hits this problem: someone sends a support ticket with a user's email address, a log contains an API key, or a prompt includes a customer's phone number — and all of that gets sent to an external LLM provider and potentially logged on their servers.

PII leakage in LLM pipelines is a compliance problem (GDPR, DPDPA), a security risk, and a trust issue with your users. Here's how to handle it properly in production Python.

What PII Looks Like in DevOps Contexts

It's not just names and emails. In DevOps and platform engineering contexts, PII appears as:

User emails in error logs forwarded to an LLM for analysis
Phone numbers in support tickets fed to an AI classifier
Aadhaar or PAN numbers in Indian fintech support data
API keys or tokens in prompts asking "why did my request fail?"
IP addresses in log summarization pipelines
Employee names in incident reports analyzed by an AI

The goal: detect these before the LLM call, mask them with reversible tokens, unmask them in the response, and log what was masked for audit.

Option 1: Microsoft Presidio (Full NLP-Based Detection)

Presidio is an open-source library from Microsoft that uses spaCy NLP models to detect 50+ PII entity types. Best for general-purpose text.

bash

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

python

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
 
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
 
def detect_and_mask_presidio(text: str) -> tuple[str, list]:
    results = analyzer.analyze(
        text=text,
        language="en",
        entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "IP_ADDRESS", "CREDIT_CARD"]
    )
 
    if not results:
        return text, []
 
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
            "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
            "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
            "IP_ADDRESS": OperatorConfig("replace", {"new_value": "<IP>"}),
            "CREDIT_CARD": OperatorConfig("replace", {"new_value": "<CREDIT_CARD>"}),
        }
    )
 
    return anonymized.text, results
 
text = "User john.doe@example.com called from +91-9876543210 about order #1234"
masked, entities = detect_and_mask_presidio(text)
print(masked)
# "User <EMAIL> called from <PHONE> about order #1234"

Presidio is accurate but slower — expect 50–200ms per call depending on text length and model size. Not suitable for high-throughput real-time paths without caching or batching.

Option 2: Regex-Based Fast Detection for Known Patterns

For high-volume pipelines where you know the PII patterns, regex is 10–100x faster than NLP and works without downloading spaCy models.

python

import re
import uuid
 
# Pattern library
PII_PATTERNS = {
    "EMAIL": re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
    "PHONE_IN": re.compile(r'(\+91[\-\s]?)?[6-9]\d{9}'),
    "AADHAAR": re.compile(r'\b[2-9]\d{3}\s?\d{4}\s?\d{4}\b'),
    "PAN": re.compile(r'\b[A-Z]{5}[0-9]{4}[A-Z]{1}\b'),
    "IP_V4": re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b'),
    "AWS_KEY": re.compile(r'\b(AKIA|ASIA)[0-9A-Z]{16}\b'),
    "GENERIC_TOKEN": re.compile(r'\b(sk-|ghp_|glpat-|Bearer\s)[A-Za-z0-9_\-]{20,}\b'),
}
 
def mask_with_tokens(text: str) -> tuple[str, dict]:
    """
    Returns masked text and a mapping of token -> original value
    for unmasking after LLM response.
    """
    token_map = {}
    masked = text
 
    for pii_type, pattern in PII_PATTERNS.items():
        def replace_match(m, pii_type=pii_type):
            token = f"[{pii_type}_{uuid.uuid4().hex[:8].upper()}]"
            token_map[token] = m.group(0)
            return token
 
        masked = pattern.sub(replace_match, masked)
 
    return masked, token_map
 
def unmask_response(response: str, token_map: dict) -> str:
    """Replace tokens in LLM response with original values."""
    for token, original in token_map.items():
        response = response.replace(token, original)
    return response

Test it:

python

text = "User ABCDE1234F called from +919876543210 with key AKIAIOSFODNN7EXAMPLE"
masked, tokens = mask_with_tokens(text)
print(masked)
# "User [PAN_A1B2C3D4] called from [PHONE_IN_E5F6G7H8] with key [AWS_KEY_I9J0K1L2]"
print(tokens)
# {'[PAN_A1B2C3D4]': 'ABCDE1234F', '[PHONE_IN_E5F6G7H8]': '+919876543210', ...}

Combining Both Approaches

Use regex first (fast, catches known patterns), then Presidio for anything that slipped through:

python

def sanitize_for_llm(text: str) -> tuple[str, dict, list]:
    # Pass 1: fast regex masking
    masked, token_map = mask_with_tokens(text)
 
    # Pass 2: NLP for PERSON names and other entities regex can't catch
    nlp_masked, entities = detect_and_mask_presidio(masked)
 
    return nlp_masked, token_map, entities

Full Pipeline: Mask → LLM → Unmask

python

import anthropic
 
client = anthropic.Anthropic()
 
def llm_call_with_pii_protection(user_prompt: str) -> str:
    # Step 1: sanitize
    safe_prompt, token_map, entities = sanitize_for_llm(user_prompt)
 
    # Step 2: audit log what was masked
    if token_map or entities:
        audit_log({
            "original_length": len(user_prompt),
            "masked_tokens": list(token_map.keys()),
            "presidio_entities": [e.entity_type for e in entities],
            "timestamp": datetime.utcnow().isoformat()
        })
 
    # Step 3: LLM call with sanitized prompt
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": safe_prompt}]
    )
 
    raw_response = response.content[0].text
 
    # Step 4: unmask if needed (token references in LLM response)
    final_response = unmask_response(raw_response, token_map)
 
    return final_response
 
def audit_log(entry: dict):
    # Write to your audit system — S3, CloudWatch, or a database
    # NEVER log the original values, only the token keys
    print(f"[AUDIT] PII masked: {entry}")

Testing PII Detection Coverage

Write tests for your patterns before going to production:

python

import pytest
 
test_cases = [
    ("Call me at +91-9876543210", True, "PHONE_IN"),
    ("My PAN is ABCDE1234F", True, "PAN"),
    ("Email: user@example.com", True, "EMAIL"),
    ("AWS key: AKIAIOSFODNN7EXAMPLE", True, "AWS_KEY"),
    ("No PII here, just DevOps talk", False, None),
]
 
def test_pii_detection():
    for text, should_detect, expected_type in test_cases:
        masked, token_map = mask_with_tokens(text)
        if should_detect:
            assert token_map, f"Should have detected PII in: {text}"
            assert any(expected_type in k for k in token_map.keys()), \
                f"Expected {expected_type} in {token_map.keys()}"
        else:
            assert not token_map, f"False positive in: {text}"

Key Rules for Production

Never log the original PII values — only log that masking occurred and which types
Keep the token map in memory only for the duration of a single request; never persist it
Test your patterns against real data samples before deploying — edge cases exist
For GDPR/DPDPA compliance, document your masking pipeline in your data processing records
Presidio + regex gives you defense-in-depth; neither alone is sufficient

The cost of retrofitting PII protection after a data leak is orders of magnitude higher than building it in from day one.

PII Detection and Masking in LLM Pipelines for Production

What PII Looks Like in DevOps Contexts

Option 1: Microsoft Presidio (Full NLP-Based Detection)

Option 2: Regex-Based Fast Detection for Known Patterns

Combining Both Approaches

Full Pipeline: Mask → LLM → Unmask

Testing PII Detection Coverage

Key Rules for Production

Stay ahead of the curve

Related Articles

Defending Against Prompt Injection and Context Poisoning in LLM Apps

Build an AI AWS Security Auditor with Claude API and Boto3

Build an AI Kubernetes Deployment Readiness Checker with Claude API

Comments