Data Masking & Anonymization Code Review Guide
Table of Contents
1. Introduction to Data Masking & Anonymization
Data masking and anonymization are techniques for transforming sensitive data so it can be used safely — in non-production environments, analytics pipelines, logs, API responses, and third-party integrations — without exposing the original values. Unlike encryption (which is reversible with a key), many masking and anonymization techniques are irreversible by design: once data is masked, the original cannot be recovered.
Why Masking Is Not Optional
GDPR Article 25 mandates "data protection by design and by default." CCPA grants consumers the right to deletion, which anonymization can satisfy. HIPAA requires de-identification of protected health information (PHI) for research use. PCI DSS prohibits storing full card numbers after authorization — tokenization is the standard solution. Failing to mask data in non-production environments, logs, or analytics pipelines is a compliance violation in most regulatory frameworks, and a common finding in security audits.
In this guide, you'll learn the difference between masking, tokenization, pseudonymization, and anonymization — and when to use each. You'll see how to implement field-level masking in code, review common mistakes in data pipelines, understand database-level dynamic masking, and apply formal anonymization guarantees like k-anonymity and differential privacy.
Data Masking Techniques at a Glance
From Raw PII to Safe Output
What is the key difference between data masking and encryption?
2. Real-World Scenario
The Scenario: You're reviewing a fintech application that processes loan applications. The development team needs production-like data for testing, the analytics team needs customer data for reporting, and a third-party ML vendor needs training data. All three requests involve PII.
❌ Vulnerable: Common Data Masking Failures
1// --- Test Data Seeding Script ---
2async function seedTestDatabase() {
3 // ❌ Copies production data directly to test environment
4 const prodUsers = await prodDb.query('SELECT * FROM users');
5 for (const user of prodUsers) {
6 await testDb.query(
7 'INSERT INTO users VALUES ($1, $2, $3, $4, $5)',
8 [user.id, user.name, user.email, user.ssn, user.credit_score]
9 );
10 }
11 console.log(`Seeded ${prodUsers.length} real users to test DB`);
12}
13
14// --- Analytics Export Endpoint ---
15app.get('/api/analytics/export', adminOnly, async (req, res) => {
16 const data = await db.query(`
17 SELECT name, email, date_of_birth, zip_code, loan_amount,
18 income, credit_score, employment_status
19 FROM loan_applications
20 WHERE created_at > $1
21 `, [req.query.since]);
22
23 // ❌ Full PII sent to analytics — "we'll mask it in the BI tool"
24 res.csv(data);
25});
26
27// --- ML Training Data Export ---
28app.get('/api/ml/training-data', apiKeyAuth, async (req, res) => {
29 const applications = await db.query(
30 'SELECT * FROM loan_applications LIMIT 100000'
31 );
32
33 // ❌ "Masking" by removing the name field but keeping everything else
34 const "masked" = applications.map(app => {
35 const { name, ...rest } = app;
36 return rest;
37 });
38
39 // ❌ email + DOB + zip code = re-identification
40 res.json(masked);
41});
42
43// --- Log Masking Attempt ---
44function maskSensitiveData(data: any) {
45 // ❌ Only masks exact field names — misses variations
46 if (data.ssn) data.ssn = '***-**-****';
47 if (data.password) data.password = '[REDACTED]';
48 return data;
49 // Misses: socialSecurityNumber, SSN, social_security,
50 // pass, passwd, pwd, secret, creditCard, etc.
51}Five Critical Failures
This code contains: 1) Production data cloned to test environments without masking — a single test DB breach exposes all real customers. 2) PII exported raw for analytics with the promise of "masking later" — data is already exposed the moment it leaves the secure boundary. 3) Naive masking that removes names but keeps quasi-identifiers (email + DOB + zip = re-identification). 4) Incomplete field matching — only exact field names are caught, missing common variations. 5) No formal anonymization guarantees — no k-anonymity, no differential privacy, no verification that the output cannot be reversed.
An analytics export removes the 'name' field but includes email, date_of_birth, and zip_code. Is this data properly anonymized?
3. Data Masking Techniques
Data masking replaces sensitive values with realistic but fake alternatives. There are two primary modes: static masking (applied once to create a sanitized copy of the data) and dynamic masking (applied at query time based on the requesting user's role).
Masking Techniques Comparison
| Technique | How It Works | Reversible? | Best For |
|---|---|---|---|
| Substitution | Replace real values with realistic fake data from a lookup table | No | Names, addresses, emails in test environments |
| Shuffling | Randomly swap values between records within the same column | No (without the shuffle key) | Preserving statistical distribution while breaking identity links |
| Character Masking | Replace characters with a fixed symbol (e.g., * or X) | No | Display masking: SSN, credit cards, phone numbers in UI |
| Numeric Variance | Add random noise within a defined range (±5%, ±10) | No | Salaries, ages, financial figures for analytics |
| Date Shifting | Shift dates by a consistent random offset per record | No (without the offset) | Preserving time intervals while hiding actual dates |
| Nulling Out | Replace sensitive values with NULL or empty string | No | Fields not needed in the target environment at all |
| Format-Preserving Masking | Replace value with a fake that matches the same format/regex | Depends on method | Credit cards (must pass Luhn check), phone numbers, IDs |
❌ Vulnerable: Naive Masking Implementations
1// ❌ Pattern 1: Incomplete character masking
2function maskEmail(email: string): string {
3 // Only masks the local part — domain reveals the company
4 return email.replace(/^.+@/, '****@');
5 // "john.doe@secretstartup.com" → "****@secretstartup.com"
6 // ❌ The domain is still identifying!
7}
8
9// ❌ Pattern 2: Deterministic masking without salt
10function maskSSN(ssn: string): string {
11 // Same input always produces the same output
12 const hash = crypto.createHash('md5').update(ssn).digest('hex');
13 return hash.slice(0, 3) + '-' + hash.slice(3, 5) + '-' + hash.slice(5, 9);
14 // ❌ Attacker can pre-compute all 1 billion SSNs and reverse the mapping
15}
16
17// ❌ Pattern 3: Preserving too much structure
18function maskPhoneNumber(phone: string): string {
19 // Keeps area code — reveals geographic location
20 return phone.slice(0, 3) + '-***-****';
21 // "415-555-1234" → "415-***-****"
22 // ❌ Area code narrows down to San Francisco
23}
24
25// ❌ Pattern 4: Regex-only masking for logs
26function maskLogEntry(log: string): string {
27 // Only catches one SSN format — misses "555 12 3456" or "555123456"
28 return log.replace(/\d{3}-\d{2}-\d{4}/g, '***-**-****');
29}✅ Secure: Proper Masking Implementations
1import crypto from 'crypto';
2
3// ✅ Secure email masking — masks both local and domain parts
4function maskEmail(email: string): string {
5 const [local, domain] = email.split('@');
6 if (!local || !domain) return '***@***.***';
7 const maskedLocal = local[0] + '***';
8 const domainParts = domain.split('.');
9 const maskedDomain = domainParts[0][0] + '***.' + domainParts.slice(1).join('.');
10 return maskedLocal + '@' + maskedDomain;
11 // "john.doe@secretstartup.com" → "j***@s***.com"
12}
13
14// ✅ Format-preserving masking with HMAC (keyed, not reversible without key)
15function maskSSN(ssn: string, maskingKey: string): string {
16 const hmac = crypto.createHmac('sha256', maskingKey).update(ssn).digest('hex');
17 return hmac.slice(0, 3) + '-' + hmac.slice(3, 5) + '-' + hmac.slice(5, 9);
18 // ✅ Without the maskingKey, pre-computation attacks are infeasible
19}
20
21// ✅ Phone masking — removes area code, keeps last 4 for usability
22function maskPhoneNumber(phone: string): string {
23 const digits = phone.replace(/\D/g, '');
24 if (digits.length < 4) return '***-***-****';
25 return '***-***-' + digits.slice(-4);
26}
27
28// ✅ Comprehensive PII masking for objects
29interface MaskingConfig {
30 key: string;
31 rules: Record<string, (value: string, key: string) => string>;
32}
33
34function createMasker(config: MaskingConfig) {
35 const fieldPatterns = new Map<RegExp, (value: string, key: string) => string>([
36 [/^(email|e_mail|email_address)$/i, (v) => maskEmail(v)],
37 [/^(ssn|social_security|social_security_number)$/i, (v) => maskSSN(v, config.key)],
38 [/^(phone|telephone|mobile|cell|phone_number)$/i, (v) => maskPhoneNumber(v)],
39 [/^(name|first_name|last_name|full_name)$/i, () => '[MASKED]'],
40 [/^(password|passwd|pwd|secret|token|api_key)$/i, () => '[REDACTED]'],
41 [/^(card_number|credit_card|cc_number)$/i, (v) => '****-****-****-' + v.slice(-4)],
42 [/^(address|street|street_address)$/i, () => '[MASKED ADDRESS]'],
43 [/^(date_of_birth|dob|birth_date|birthdate)$/i, (v) => v.slice(0, 4) + '-**-**'],
44 ]);
45
46 return function maskObject<T extends Record<string, unknown>>(obj: T): T {
47 const masked = { ...obj };
48 for (const [field, value] of Object.entries(masked)) {
49 if (typeof value !== 'string') continue;
50 for (const [pattern, maskFn] of fieldPatterns) {
51 if (pattern.test(field)) {
52 (masked as Record<string, unknown>)[field] = maskFn(value, field);
53 break;
54 }
55 }
56 }
57 return masked;
58 };
59}
60
61// Usage:
62// const masker = createMasker({ key: process.env.MASKING_KEY!, rules: {} });
63// const safeUser = masker(user);A developer masks SSNs by hashing them with MD5 (no salt/key). Why is this insecure?
4. Tokenization
Tokenization replaces sensitive data with a non-sensitive surrogate (a "token") that maps back to the original through a secure token vault. Unlike encryption, tokens have no mathematical relationship to the original data — they cannot be reversed without access to the vault. Tokenization is the industry standard for PCI DSS compliance (credit card processing) and is increasingly used for PII protection.
Tokenization vs. Encryption vs. Masking
| Property | Tokenization | Encryption | Masking |
|---|---|---|---|
| Reversible? | Yes (with vault access) | Yes (with key) | Usually no |
| Original stored where? | Secure token vault (separate system) | In place (as ciphertext) | Nowhere (destroyed) |
| Output format | Configurable (same length, same format) | Different format/length | Same or different format |
| Performance | Vault lookup per detokenization | CPU-bound crypto operation | Simple string transformation |
| PCI DSS scope | Token is out of scope | Ciphertext is in scope | Masked value is out of scope |
| Best for | Payment processing, PII that needs occasional retrieval | Data at rest that is frequently read | Test data, logs, analytics where original is never needed |
❌ Vulnerable: Broken Tokenization
1// ❌ Pattern 1: "Tokenization" that's just encoding
2function tokenize(creditCard: string): string {
3 // ❌ Base64 is NOT tokenization — trivially reversible
4 return Buffer.from(creditCard).toString('base64');
5}
6
7// ❌ Pattern 2: Token derived from original data
8function tokenizeSSN(ssn: string): string {
9 // ❌ Deterministic hash — same SSN always maps to same token
10 // Enables frequency analysis and cross-dataset correlation
11 return 'tok_' + crypto.createHash('sha256').update(ssn).digest('hex').slice(0, 16);
12}
13
14// ❌ Pattern 3: Token vault with no access controls
15const tokenVault = new Map<string, string>(); // ❌ In-memory, no encryption
16function storeToken(token: string, original: string) {
17 tokenVault.set(token, original); // ❌ Anyone with memory access can dump the vault
18}
19
20// ❌ Pattern 4: Detokenization without audit logging
21app.get('/api/detokenize/:token', adminOnly, async (req, res) => {
22 const original = await vault.detokenize(req.params.token);
23 // ❌ No audit log of who accessed what and why
24 res.json({ value: original });
25});✅ Secure: Proper Tokenization Implementation
1import crypto from 'crypto';
2
3interface TokenVaultConfig {
4 encryptionKey: Buffer;
5 db: Database;
6}
7
8class TokenVault {
9 private config: TokenVaultConfig;
10
11 constructor(config: TokenVaultConfig) {
12 this.config = config;
13 }
14
15 // ✅ Generate a cryptographically random token with no relation to original
16 async tokenize(originalValue: string, dataType: string): Promise<string> {
17 const token = 'tok_' + crypto.randomBytes(16).toString('hex');
18
19 // ✅ Encrypt the original before storing in the vault
20 const iv = crypto.randomBytes(16);
21 const cipher = crypto.createCipheriv('aes-256-gcm', this.config.encryptionKey, iv);
22 let encrypted = cipher.update(originalValue, 'utf8', 'hex');
23 encrypted += cipher.final('hex');
24 const authTag = cipher.getAuthTag().toString('hex');
25
26 // ✅ Store encrypted original in isolated vault database
27 await this.config.db.query(
28 `INSERT INTO token_vault (token, encrypted_value, iv, auth_tag, data_type, created_at)
29 VALUES ($1, $2, $3, $4, $5, NOW())`,
30 [token, encrypted, iv.toString('hex'), authTag, dataType]
31 );
32
33 return token;
34 }
35
36 // ✅ Detokenize with mandatory audit logging and access control
37 async detokenize(
38 token: string,
39 requestContext: { userId: string; reason: string; ticketId?: string }
40 ): Promise<string> {
41 const record = await this.config.db.query(
42 'SELECT encrypted_value, iv, auth_tag FROM token_vault WHERE token = $1',
43 [token]
44 );
45
46 if (!record) throw new Error('Token not found');
47
48 // ✅ Audit log every detokenization
49 await this.config.db.query(
50 `INSERT INTO detokenization_audit_log
51 (token, user_id, reason, ticket_id, timestamp, ip_address)
52 VALUES ($1, $2, $3, $4, NOW(), $5)`,
53 [token, requestContext.userId, requestContext.reason,
54 requestContext.ticketId, requestContext.userId]
55 );
56
57 // ✅ Decrypt the original value
58 const iv = Buffer.from(record.iv, 'hex');
59 const decipher = crypto.createDecipheriv('aes-256-gcm', this.config.encryptionKey, iv);
60 decipher.setAuthTag(Buffer.from(record.auth_tag, 'hex'));
61 let decrypted = decipher.update(record.encrypted_value, 'hex', 'utf8');
62 decrypted += decipher.final('utf8');
63
64 return decrypted;
65 }
66
67 // ✅ Batch tokenization for data pipelines
68 async tokenizeBatch(
69 records: Array<{ field: string; value: string }>,
70 dataType: string
71 ): Promise<Map<string, string>> {
72 const tokenMap = new Map<string, string>();
73 for (const record of records) {
74 const token = await this.tokenize(record.value, dataType);
75 tokenMap.set(record.value, token);
76 }
77 return tokenMap;
78 }
79}A payment system 'tokenizes' credit card numbers by Base64-encoding them. Why is this not real tokenization?
5. Anonymization & Pseudonymization
Pseudonymization replaces identifying fields with artificial identifiers — it's reversible with a mapping table, so the data is still considered personal data under GDPR. Anonymization is irreversible: no one, including the data holder, can re-identify the individuals. GDPR does not apply to truly anonymized data. The distinction has massive regulatory implications.
Pseudonymization vs. Anonymization
| Property | Pseudonymization | Anonymization |
|---|---|---|
| Reversible? | Yes (with mapping/key) | No — designed to be irreversible |
| GDPR applies? | Yes — still personal data | No — falls outside GDPR scope |
| Direct identifiers | Replaced with pseudonyms | Removed or destroyed |
| Quasi-identifiers | Often left intact | Generalized, suppressed, or perturbed |
| Use cases | Internal analytics, research with re-linking capability | Public datasets, third-party sharing, open data |
| Regulatory status | Meets GDPR Art. 25 "appropriate safeguard" | Exempt from GDPR (Recital 26) |
❌ Vulnerable: Fake "Anonymization"
1// ❌ Pattern 1: Pseudonymization labeled as anonymization
2function "anonymize"(user: User) {
3 return {
4 id: crypto.randomUUID(), // New random ID
5 name: '[ANONYMIZED]', // Removed
6 email: '[ANONYMIZED]', // Removed
7 // ❌ These quasi-identifiers enable re-identification:
8 dateOfBirth: user.dateOfBirth, // ❌ Exact DOB kept
9 zipCode: user.zipCode, // ❌ Full zip kept
10 gender: user.gender, // ❌ Gender kept
11 // DOB + ZIP + gender = 87% re-identification rate
12 joinDate: user.joinDate, // ❌ Exact join date kept
13 purchaseHistory: user.purchases, // ❌ Behavioral data is identifying
14 };
15}
16
17// ❌ Pattern 2: Consistent pseudonyms across datasets
18function pseudonymizeForExport(userId: string): string {
19 // Same user always gets the same pseudonym across ALL exports
20 return crypto.createHash('sha256').update(userId + 'static_salt').digest('hex');
21 // ❌ Enables cross-dataset linking — if attacker has one dataset
22 // with known users, they can link to "anonymous" dataset
23}
24
25// ❌ Pattern 3: Insufficient suppression threshold
26function anonymizeSmallGroup(records: any[]) {
27 // Only 2 people aged 95+ in zip code 02139
28 // ❌ No k-anonymity check — groups of 1-2 are trivially identifiable
29 return records.map(r => ({
30 ageRange: Math.floor(r.age / 5) * 5 + '-' + (Math.floor(r.age / 5) * 5 + 4),
31 zipCode: r.zipCode, // ❌ Full zip kept
32 condition: r.medicalCondition,
33 }));
34}✅ Secure: Proper Anonymization with k-Anonymity
1interface AnonymizationConfig {
2 kThreshold: number; // Minimum group size (typically k >= 5)
3 quasiIdentifiers: string[]; // Fields that could enable re-identification
4 sensitiveFields: string[]; // Fields to protect (the "payload")
5 suppressionLimit: number; // Max % of records to suppress (typically 5%)
6}
7
8function anonymizeDataset<T extends Record<string, unknown>>(
9 records: T[],
10 config: AnonymizationConfig
11): T[] {
12 let anonymized = records.map(record => {
13 const result = { ...record };
14
15 // ✅ Remove all direct identifiers
16 delete result.name;
17 delete result.email;
18 delete result.ssn;
19 delete result.phone;
20 delete result.address;
21
22 // ✅ Generalize quasi-identifiers
23 if ('dateOfBirth' in result && typeof result.dateOfBirth === 'string') {
24 const year = new Date(result.dateOfBirth).getFullYear();
25 const decade = Math.floor(year / 10) * 10;
26 (result as Record<string, unknown>).dateOfBirth = `${decade}s`;
27 }
28
29 if ('zipCode' in result && typeof result.zipCode === 'string') {
30 (result as Record<string, unknown>).zipCode =
31 (result.zipCode as string).slice(0, 3) + '**';
32 }
33
34 if ('age' in result && typeof result.age === 'number') {
35 const bucket = Math.floor(result.age / 10) * 10;
36 (result as Record<string, unknown>).age = `${bucket}-${bucket + 9}`;
37 }
38
39 return result;
40 });
41
42 // ✅ Verify k-anonymity: every combination of quasi-identifiers
43 // must appear in at least k records
44 const groups = new Map<string, T[]>();
45 for (const record of anonymized) {
46 const key = config.quasiIdentifiers
47 .map(qi => String(record[qi] ?? ''))
48 .join('|');
49 if (!groups.has(key)) groups.set(key, []);
50 groups.get(key)!.push(record);
51 }
52
53 // ✅ Suppress groups smaller than k (remove them entirely)
54 const suppressedCount = [...groups.values()]
55 .filter(g => g.length < config.kThreshold)
56 .reduce((sum, g) => sum + g.length, 0);
57
58 if (suppressedCount / records.length > config.suppressionLimit) {
59 throw new Error(
60 'Suppression would remove ' +
61 (suppressedCount / records.length * 100).toFixed(1) +
62 '% of records (limit: ' +
63 (config.suppressionLimit * 100) +
64 '%). Increase generalization before exporting.'
65 );
66 }
67
68 anonymized = anonymized.filter(record => {
69 const key = config.quasiIdentifiers
70 .map(qi => String(record[qi] ?? ''))
71 .join('|');
72 return (groups.get(key)?.length ?? 0) >= config.kThreshold;
73 });
74
75 return anonymized;
76}
77
78// Usage:
79// const safeData = anonymizeDataset(patients, {
80// kThreshold: 5,
81// quasiIdentifiers: ['dateOfBirth', 'zipCode', 'gender'],
82// sensitiveFields: ['diagnosis', 'treatment'],
83// suppressionLimit: 0.05,
84// });A dataset has been 'anonymized' by replacing names with random IDs but keeping exact date of birth, full zip code, and gender. Under GDPR, is this anonymization or pseudonymization?
6. Implementation Patterns
Masking must be applied consistently at every boundary where data leaves the secure perimeter: API responses, log output, error messages, analytics pipelines, data exports, and non-production environments. A centralized masking layer prevents the common failure mode of masking in some places but not others.
✅ Centralized Masking Middleware for APIs
1type SensitivityLevel = 'public' | 'internal' | 'confidential' | 'restricted';
2
3interface FieldPolicy {
4 sensitivity: SensitivityLevel;
5 maskFn: (value: unknown) => unknown;
6}
7
8// ✅ Define masking policies per field at the schema level
9const USER_FIELD_POLICIES: Record<string, FieldPolicy> = {
10 id: { sensitivity: 'internal', maskFn: v => v },
11 displayName: { sensitivity: 'public', maskFn: v => v },
12 email: { sensitivity: 'confidential', maskFn: v => maskEmail(v as string) },
13 phone: { sensitivity: 'confidential', maskFn: v => maskPhoneNumber(v as string) },
14 ssn: { sensitivity: 'restricted', maskFn: () => '[RESTRICTED]' },
15 passwordHash: { sensitivity: 'restricted', maskFn: () => undefined }, // never include
16 dateOfBirth: { sensitivity: 'confidential', maskFn: v => (v as string).slice(0, 4) + '-**-**' },
17 salary: { sensitivity: 'restricted', maskFn: () => '[RESTRICTED]' },
18 role: { sensitivity: 'internal', maskFn: v => v },
19};
20
21// ✅ Apply masking based on requester's clearance level
22function applyMasking(
23 data: Record<string, unknown>,
24 policies: Record<string, FieldPolicy>,
25 requesterClearance: SensitivityLevel
26): Record<string, unknown> {
27 const clearanceLevels: Record<SensitivityLevel, number> = {
28 public: 0, internal: 1, confidential: 2, restricted: 3,
29 };
30
31 const result: Record<string, unknown> = {};
32
33 for (const [field, value] of Object.entries(data)) {
34 const policy = policies[field];
35 if (!policy) continue; // ✅ Unlisted fields are excluded by default
36
37 if (clearanceLevels[requesterClearance] >= clearanceLevels[policy.sensitivity]) {
38 result[field] = value; // Requester has clearance — return original
39 } else {
40 const masked = policy.maskFn(value);
41 if (masked !== undefined) result[field] = masked; // Apply masking
42 }
43 }
44
45 return result;
46}
47
48// ✅ API middleware that applies masking automatically
49function maskingMiddleware(policies: Record<string, FieldPolicy>) {
50 return (req: Request, res: Response, next: NextFunction) => {
51 const originalJson = res.json.bind(res);
52 const clearance = getUserClearance(req.user);
53
54 res.json = (body: unknown) => {
55 if (Array.isArray(body)) {
56 return originalJson(
57 body.map(item =>
58 typeof item === 'object' && item !== null
59 ? applyMasking(item as Record<string, unknown>, policies, clearance)
60 : item
61 )
62 );
63 }
64 if (typeof body === 'object' && body !== null) {
65 return originalJson(
66 applyMasking(body as Record<string, unknown>, policies, clearance)
67 );
68 }
69 return originalJson(body);
70 };
71 next();
72 };
73}✅ Static Masking for Test Environments
1import { faker } from '@faker-js/faker';
2
3interface StaticMaskingRule {
4 column: string;
5 generator: () => string | number;
6}
7
8// ✅ Generate realistic but fake data for each field
9const maskingRules: Record<string, StaticMaskingRule[]> = {
10 users: [
11 { column: 'first_name', generator: () => faker.person.firstName() },
12 { column: 'last_name', generator: () => faker.person.lastName() },
13 { column: 'email', generator: () => faker.internet.email() },
14 { column: 'phone', generator: () => faker.phone.number() },
15 { column: 'ssn', generator: () => faker.string.numeric('###-##-####') },
16 { column: 'date_of_birth', generator: () =>
17 faker.date.between({ from: '1950-01-01', to: '2005-12-31' }).toISOString().slice(0, 10) },
18 { column: 'address', generator: () => faker.location.streetAddress() },
19 { column: 'password_hash', generator: () => '$2b$12$invalidhashfortesting' },
20 ],
21 loan_applications: [
22 { column: 'applicant_name', generator: () => faker.person.fullName() },
23 { column: 'income', generator: () =>
24 Math.round(faker.number.int({ min: 30000, max: 200000 }) / 1000) * 1000 },
25 { column: 'credit_score', generator: () =>
26 faker.number.int({ min: 300, max: 850 }) },
27 ],
28};
29
30// ✅ Apply static masking to a full database export
31async function createMaskedTestDatabase(
32 sourceDb: Database,
33 targetDb: Database,
34 rules: Record<string, StaticMaskingRule[]>
35) {
36 for (const [table, tableRules] of Object.entries(rules)) {
37 const rows = await sourceDb.query(`SELECT * FROM ${table}`);
38 const columnMap = new Map(tableRules.map(r => [r.column, r.generator]));
39
40 for (const row of rows) {
41 const maskedRow = { ...row };
42 for (const [col, generator] of columnMap) {
43 if (col in maskedRow) maskedRow[col] = generator();
44 }
45 await targetDb.insert(table, maskedRow);
46 }
47
48 console.log(`Masked ${rows.length} rows in ${table}`);
49 }
50}A masking middleware only masks fields that are explicitly listed in its policy. Unlisted fields pass through unchanged. What is the security risk?
7. Code Review Defenses
Data Masking Code Review Principles
1) Mask at the boundary: Apply masking where data exits the secure perimeter (API responses, exports, logs, test data). 2) Use allow-lists, not deny-lists: Only include explicitly declared fields in output — don't try to enumerate every sensitive field to block. 3) Classify at the schema level: Annotate fields with sensitivity levels in the data model, not in individual endpoints. 4) Test masking: Write automated tests that verify sensitive fields are properly masked in API responses. 5) Separate environments: Never use production data in non-production environments without static masking. 6) Audit detokenization: Every time original data is recovered from a token, log who, when, and why.
✅ Automated Masking Verification Tests
1describe('API response masking', () => {
2 it('should not expose restricted fields to regular users', async () => {
3 const response = await request(app)
4 .get('/api/users/123')
5 .set('Authorization', 'Bearer regular_user_token');
6
7 // ✅ Verify restricted fields are absent or masked
8 expect(response.body).not.toHaveProperty('passwordHash');
9 expect(response.body).not.toHaveProperty('password_hash');
10 expect(response.body).not.toHaveProperty('ssn');
11 expect(response.body).not.toHaveProperty('salary');
12 expect(response.body).not.toHaveProperty('internalNotes');
13
14 // ✅ Verify confidential fields are masked
15 if (response.body.email) {
16 expect(response.body.email).toMatch(/^.\*\*\*@.\*\*\*\..+$/);
17 }
18 if (response.body.phone) {
19 expect(response.body.phone).toMatch(/^\*\*\*-\*\*\*-\d{4}$/);
20 }
21 });
22
23 it('should mask PII in log output', () => {
24 const logSpy = jest.spyOn(logger, 'info');
25
26 // Trigger a log event
27 processUserRequest({
28 email: 'john@example.com',
29 ssn: '123-45-6789',
30 password: 'secret123',
31 });
32
33 const loggedData = logSpy.mock.calls[0][1];
34
35 // ✅ Verify no raw PII in logs
36 expect(JSON.stringify(loggedData)).not.toContain('john@example.com');
37 expect(JSON.stringify(loggedData)).not.toContain('123-45-6789');
38 expect(JSON.stringify(loggedData)).not.toContain('secret123');
39 });
40
41 it('should enforce k-anonymity on analytics exports', async () => {
42 const response = await request(app)
43 .get('/api/analytics/export?since=2025-01-01')
44 .set('Authorization', 'Bearer analyst_token');
45
46 const records = response.body.data;
47
48 // ✅ Verify no direct identifiers
49 for (const record of records) {
50 expect(record).not.toHaveProperty('name');
51 expect(record).not.toHaveProperty('email');
52 expect(record).not.toHaveProperty('ssn');
53 expect(record).not.toHaveProperty('phone');
54 }
55
56 // ✅ Verify quasi-identifiers are generalized
57 for (const record of records) {
58 if (record.zipCode) {
59 expect(record.zipCode).toMatch(/^\d{3}\*\*$/);
60 }
61 if (record.dateOfBirth) {
62 expect(record.dateOfBirth).toMatch(/^\d{4}s$/);
63 }
64 }
65
66 // ✅ Verify k-anonymity (k >= 5)
67 const groups = new Map<string, number>();
68 for (const record of records) {
69 const key = [record.dateOfBirth, record.zipCode, record.gender].join('|');
70 groups.set(key, (groups.get(key) || 0) + 1);
71 }
72 for (const [, count] of groups) {
73 expect(count).toBeGreaterThanOrEqual(5);
74 }
75 });
76});✅ CI Pipeline Masking Verification
1// ✅ Pre-deploy check: scan API schemas for unmasked sensitive fields
2function auditResponseSchemas(schemas: Record<string, ResponseSchema>): AuditResult[] {
3 const sensitivePatterns = [
4 /password/i, /passwd/i, /secret/i,
5 /ssn/i, /social.?security/i,
6 /credit.?card/i, /card.?number/i, /cvv/i,
7 /api.?key/i, /private.?key/i, /token/i,
8 /date.?of.?birth/i, /dob/i,
9 ];
10
11 const violations: AuditResult[] = [];
12
13 for (const [endpoint, schema] of Object.entries(schemas)) {
14 for (const field of schema.fields) {
15 for (const pattern of sensitivePatterns) {
16 if (pattern.test(field.name) && !field.masked) {
17 violations.push({
18 endpoint,
19 field: field.name,
20 severity: 'critical',
21 message: `Potentially sensitive field '${field.name}' is not masked in ${endpoint}`,
22 });
23 }
24 }
25 }
26 }
27
28 return violations;
29}Which of these is the most effective way to ensure data masking is consistently applied across all API endpoints?