🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

PII Detection and Masking in LLM Pipelines for Production

How to detect and mask PII before it reaches your LLM and leaks in responses. Covers Microsoft Presidio, regex detection for Indian data (Aadhaar, PAN), token-based masking, and audit logging.

DevOpsBoys5 min read
Share:Tweet

Every DevOps team building LLM features eventually hits this problem: someone sends a support ticket with a user's email address, a log contains an API key, or a prompt includes a customer's phone number — and all of that gets sent to an external LLM provider and potentially logged on their servers.

PII leakage in LLM pipelines is a compliance problem (GDPR, DPDPA), a security risk, and a trust issue with your users. Here's how to handle it properly in production Python.

What PII Looks Like in DevOps Contexts

It's not just names and emails. In DevOps and platform engineering contexts, PII appears as:

  • User emails in error logs forwarded to an LLM for analysis
  • Phone numbers in support tickets fed to an AI classifier
  • Aadhaar or PAN numbers in Indian fintech support data
  • API keys or tokens in prompts asking "why did my request fail?"
  • IP addresses in log summarization pipelines
  • Employee names in incident reports analyzed by an AI

The goal: detect these before the LLM call, mask them with reversible tokens, unmask them in the response, and log what was masked for audit.

Option 1: Microsoft Presidio (Full NLP-Based Detection)

Presidio is an open-source library from Microsoft that uses spaCy NLP models to detect 50+ PII entity types. Best for general-purpose text.

bash
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
python
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
 
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
 
def detect_and_mask_presidio(text: str) -> tuple[str, list]:
    results = analyzer.analyze(
        text=text,
        language="en",
        entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "IP_ADDRESS", "CREDIT_CARD"]
    )
 
    if not results:
        return text, []
 
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
            "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
            "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
            "IP_ADDRESS": OperatorConfig("replace", {"new_value": "<IP>"}),
            "CREDIT_CARD": OperatorConfig("replace", {"new_value": "<CREDIT_CARD>"}),
        }
    )
 
    return anonymized.text, results
 
text = "User john.doe@example.com called from +91-9876543210 about order #1234"
masked, entities = detect_and_mask_presidio(text)
print(masked)
# "User <EMAIL> called from <PHONE> about order #1234"

Presidio is accurate but slower — expect 50–200ms per call depending on text length and model size. Not suitable for high-throughput real-time paths without caching or batching.

Option 2: Regex-Based Fast Detection for Known Patterns

For high-volume pipelines where you know the PII patterns, regex is 10–100x faster than NLP and works without downloading spaCy models.

python
import re
import uuid
 
# Pattern library
PII_PATTERNS = {
    "EMAIL": re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
    "PHONE_IN": re.compile(r'(\+91[\-\s]?)?[6-9]\d{9}'),
    "AADHAAR": re.compile(r'\b[2-9]\d{3}\s?\d{4}\s?\d{4}\b'),
    "PAN": re.compile(r'\b[A-Z]{5}[0-9]{4}[A-Z]{1}\b'),
    "IP_V4": re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b'),
    "AWS_KEY": re.compile(r'\b(AKIA|ASIA)[0-9A-Z]{16}\b'),
    "GENERIC_TOKEN": re.compile(r'\b(sk-|ghp_|glpat-|Bearer\s)[A-Za-z0-9_\-]{20,}\b'),
}
 
def mask_with_tokens(text: str) -> tuple[str, dict]:
    """
    Returns masked text and a mapping of token -> original value
    for unmasking after LLM response.
    """
    token_map = {}
    masked = text
 
    for pii_type, pattern in PII_PATTERNS.items():
        def replace_match(m, pii_type=pii_type):
            token = f"[{pii_type}_{uuid.uuid4().hex[:8].upper()}]"
            token_map[token] = m.group(0)
            return token
 
        masked = pattern.sub(replace_match, masked)
 
    return masked, token_map
 
def unmask_response(response: str, token_map: dict) -> str:
    """Replace tokens in LLM response with original values."""
    for token, original in token_map.items():
        response = response.replace(token, original)
    return response

Test it:

python
text = "User ABCDE1234F called from +919876543210 with key AKIAIOSFODNN7EXAMPLE"
masked, tokens = mask_with_tokens(text)
print(masked)
# "User [PAN_A1B2C3D4] called from [PHONE_IN_E5F6G7H8] with key [AWS_KEY_I9J0K1L2]"
print(tokens)
# {'[PAN_A1B2C3D4]': 'ABCDE1234F', '[PHONE_IN_E5F6G7H8]': '+919876543210', ...}

Combining Both Approaches

Use regex first (fast, catches known patterns), then Presidio for anything that slipped through:

python
def sanitize_for_llm(text: str) -> tuple[str, dict, list]:
    # Pass 1: fast regex masking
    masked, token_map = mask_with_tokens(text)
 
    # Pass 2: NLP for PERSON names and other entities regex can't catch
    nlp_masked, entities = detect_and_mask_presidio(masked)
 
    return nlp_masked, token_map, entities

Full Pipeline: Mask → LLM → Unmask

python
import anthropic
 
client = anthropic.Anthropic()
 
def llm_call_with_pii_protection(user_prompt: str) -> str:
    # Step 1: sanitize
    safe_prompt, token_map, entities = sanitize_for_llm(user_prompt)
 
    # Step 2: audit log what was masked
    if token_map or entities:
        audit_log({
            "original_length": len(user_prompt),
            "masked_tokens": list(token_map.keys()),
            "presidio_entities": [e.entity_type for e in entities],
            "timestamp": datetime.utcnow().isoformat()
        })
 
    # Step 3: LLM call with sanitized prompt
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": safe_prompt}]
    )
 
    raw_response = response.content[0].text
 
    # Step 4: unmask if needed (token references in LLM response)
    final_response = unmask_response(raw_response, token_map)
 
    return final_response
 
def audit_log(entry: dict):
    # Write to your audit system — S3, CloudWatch, or a database
    # NEVER log the original values, only the token keys
    print(f"[AUDIT] PII masked: {entry}")

Testing PII Detection Coverage

Write tests for your patterns before going to production:

python
import pytest
 
test_cases = [
    ("Call me at +91-9876543210", True, "PHONE_IN"),
    ("My PAN is ABCDE1234F", True, "PAN"),
    ("Email: user@example.com", True, "EMAIL"),
    ("AWS key: AKIAIOSFODNN7EXAMPLE", True, "AWS_KEY"),
    ("No PII here, just DevOps talk", False, None),
]
 
def test_pii_detection():
    for text, should_detect, expected_type in test_cases:
        masked, token_map = mask_with_tokens(text)
        if should_detect:
            assert token_map, f"Should have detected PII in: {text}"
            assert any(expected_type in k for k in token_map.keys()), \
                f"Expected {expected_type} in {token_map.keys()}"
        else:
            assert not token_map, f"False positive in: {text}"

Key Rules for Production

  • Never log the original PII values — only log that masking occurred and which types
  • Keep the token map in memory only for the duration of a single request; never persist it
  • Test your patterns against real data samples before deploying — edge cases exist
  • For GDPR/DPDPA compliance, document your masking pipeline in your data processing records
  • Presidio + regex gives you defense-in-depth; neither alone is sufficient

The cost of retrofitting PII protection after a data leak is orders of magnitude higher than building it in from day one.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments