PII Detection and Masking in LLM Pipelines for Production
How to detect and mask PII before it reaches your LLM and leaks in responses. Covers Microsoft Presidio, regex detection for Indian data (Aadhaar, PAN), token-based masking, and audit logging.
Every DevOps team building LLM features eventually hits this problem: someone sends a support ticket with a user's email address, a log contains an API key, or a prompt includes a customer's phone number — and all of that gets sent to an external LLM provider and potentially logged on their servers.
PII leakage in LLM pipelines is a compliance problem (GDPR, DPDPA), a security risk, and a trust issue with your users. Here's how to handle it properly in production Python.
What PII Looks Like in DevOps Contexts
It's not just names and emails. In DevOps and platform engineering contexts, PII appears as:
- User emails in error logs forwarded to an LLM for analysis
- Phone numbers in support tickets fed to an AI classifier
- Aadhaar or PAN numbers in Indian fintech support data
- API keys or tokens in prompts asking "why did my request fail?"
- IP addresses in log summarization pipelines
- Employee names in incident reports analyzed by an AI
The goal: detect these before the LLM call, mask them with reversible tokens, unmask them in the response, and log what was masked for audit.
Option 1: Microsoft Presidio (Full NLP-Based Detection)
Presidio is an open-source library from Microsoft that uses spaCy NLP models to detect 50+ PII entity types. Best for general-purpose text.
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lgfrom presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
def detect_and_mask_presidio(text: str) -> tuple[str, list]:
results = analyzer.analyze(
text=text,
language="en",
entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "IP_ADDRESS", "CREDIT_CARD"]
)
if not results:
return text, []
anonymized = anonymizer.anonymize(
text=text,
analyzer_results=results,
operators={
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
"PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
"IP_ADDRESS": OperatorConfig("replace", {"new_value": "<IP>"}),
"CREDIT_CARD": OperatorConfig("replace", {"new_value": "<CREDIT_CARD>"}),
}
)
return anonymized.text, results
text = "User john.doe@example.com called from +91-9876543210 about order #1234"
masked, entities = detect_and_mask_presidio(text)
print(masked)
# "User <EMAIL> called from <PHONE> about order #1234"Presidio is accurate but slower — expect 50–200ms per call depending on text length and model size. Not suitable for high-throughput real-time paths without caching or batching.
Option 2: Regex-Based Fast Detection for Known Patterns
For high-volume pipelines where you know the PII patterns, regex is 10–100x faster than NLP and works without downloading spaCy models.
import re
import uuid
# Pattern library
PII_PATTERNS = {
"EMAIL": re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
"PHONE_IN": re.compile(r'(\+91[\-\s]?)?[6-9]\d{9}'),
"AADHAAR": re.compile(r'\b[2-9]\d{3}\s?\d{4}\s?\d{4}\b'),
"PAN": re.compile(r'\b[A-Z]{5}[0-9]{4}[A-Z]{1}\b'),
"IP_V4": re.compile(r'\b(?:\d{1,3}\.){3}\d{1,3}\b'),
"AWS_KEY": re.compile(r'\b(AKIA|ASIA)[0-9A-Z]{16}\b'),
"GENERIC_TOKEN": re.compile(r'\b(sk-|ghp_|glpat-|Bearer\s)[A-Za-z0-9_\-]{20,}\b'),
}
def mask_with_tokens(text: str) -> tuple[str, dict]:
"""
Returns masked text and a mapping of token -> original value
for unmasking after LLM response.
"""
token_map = {}
masked = text
for pii_type, pattern in PII_PATTERNS.items():
def replace_match(m, pii_type=pii_type):
token = f"[{pii_type}_{uuid.uuid4().hex[:8].upper()}]"
token_map[token] = m.group(0)
return token
masked = pattern.sub(replace_match, masked)
return masked, token_map
def unmask_response(response: str, token_map: dict) -> str:
"""Replace tokens in LLM response with original values."""
for token, original in token_map.items():
response = response.replace(token, original)
return responseTest it:
text = "User ABCDE1234F called from +919876543210 with key AKIAIOSFODNN7EXAMPLE"
masked, tokens = mask_with_tokens(text)
print(masked)
# "User [PAN_A1B2C3D4] called from [PHONE_IN_E5F6G7H8] with key [AWS_KEY_I9J0K1L2]"
print(tokens)
# {'[PAN_A1B2C3D4]': 'ABCDE1234F', '[PHONE_IN_E5F6G7H8]': '+919876543210', ...}Combining Both Approaches
Use regex first (fast, catches known patterns), then Presidio for anything that slipped through:
def sanitize_for_llm(text: str) -> tuple[str, dict, list]:
# Pass 1: fast regex masking
masked, token_map = mask_with_tokens(text)
# Pass 2: NLP for PERSON names and other entities regex can't catch
nlp_masked, entities = detect_and_mask_presidio(masked)
return nlp_masked, token_map, entitiesFull Pipeline: Mask → LLM → Unmask
import anthropic
client = anthropic.Anthropic()
def llm_call_with_pii_protection(user_prompt: str) -> str:
# Step 1: sanitize
safe_prompt, token_map, entities = sanitize_for_llm(user_prompt)
# Step 2: audit log what was masked
if token_map or entities:
audit_log({
"original_length": len(user_prompt),
"masked_tokens": list(token_map.keys()),
"presidio_entities": [e.entity_type for e in entities],
"timestamp": datetime.utcnow().isoformat()
})
# Step 3: LLM call with sanitized prompt
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
messages=[{"role": "user", "content": safe_prompt}]
)
raw_response = response.content[0].text
# Step 4: unmask if needed (token references in LLM response)
final_response = unmask_response(raw_response, token_map)
return final_response
def audit_log(entry: dict):
# Write to your audit system — S3, CloudWatch, or a database
# NEVER log the original values, only the token keys
print(f"[AUDIT] PII masked: {entry}")Testing PII Detection Coverage
Write tests for your patterns before going to production:
import pytest
test_cases = [
("Call me at +91-9876543210", True, "PHONE_IN"),
("My PAN is ABCDE1234F", True, "PAN"),
("Email: user@example.com", True, "EMAIL"),
("AWS key: AKIAIOSFODNN7EXAMPLE", True, "AWS_KEY"),
("No PII here, just DevOps talk", False, None),
]
def test_pii_detection():
for text, should_detect, expected_type in test_cases:
masked, token_map = mask_with_tokens(text)
if should_detect:
assert token_map, f"Should have detected PII in: {text}"
assert any(expected_type in k for k in token_map.keys()), \
f"Expected {expected_type} in {token_map.keys()}"
else:
assert not token_map, f"False positive in: {text}"Key Rules for Production
- Never log the original PII values — only log that masking occurred and which types
- Keep the token map in memory only for the duration of a single request; never persist it
- Test your patterns against real data samples before deploying — edge cases exist
- For GDPR/DPDPA compliance, document your masking pipeline in your data processing records
- Presidio + regex gives you defense-in-depth; neither alone is sufficient
The cost of retrofitting PII protection after a data leak is orders of magnitude higher than building it in from day one.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Defending Against Prompt Injection and Context Poisoning in LLM Apps
Real attack patterns on LLM applications and how to defend against them. Covers direct prompt injection, indirect injection via RAG documents, context poisoning, and Python code for secure vs vulnerable patterns.
Build an AI AWS Security Auditor with Claude API and Boto3
Use Python, boto3, and the Claude API to automatically audit your AWS environment for security misconfigurations and get AI-powered remediation recommendations.
Build an AI Kubernetes Deployment Readiness Checker with Claude API
Build a Python CLI tool using Claude API that analyzes Kubernetes YAML manifests before deployment — catches missing resource limits, root containers, and security issues with a go/no-go score.