Security Guide

45 min read

Data Masking & Anonymization Code Review Guide

Beginner Friendly

1. Introduction to Data Masking & Anonymization

Data masking and anonymization are techniques for transforming sensitive data so it can be used safely — in non-production environments, analytics pipelines, logs, API responses, and third-party integrations — without exposing the original values. Unlike encryption (which is reversible with a key), many masking and anonymization techniques are irreversible by design: once data is masked, the original cannot be recovered.

Why Masking Is Not Optional

GDPR Article 25 mandates "data protection by design and by default." CCPA grants consumers the right to deletion, which anonymization can satisfy. HIPAA requires de-identification of protected health information (PHI) for research use. PCI DSS prohibits storing full card numbers after authorization — tokenization is the standard solution. Failing to mask data in non-production environments, logs, or analytics pipelines is a compliance violation in most regulatory frameworks, and a common finding in security audits.

In this guide, you'll learn the difference between masking, tokenization, pseudonymization, and anonymization — and when to use each. You'll see how to implement field-level masking in code, review common mistakes in data pipelines, understand database-level dynamic masking, and apply formal anonymization guarantees like k-anonymity and differential privacy.

Data Masking Techniques at a Glance

From Raw PII to Safe Output

Static Masking

john.doe@acme.com

→

j***@a***.com

Non-production environments, test data

Dynamic Masking

555-12-3456

→

***-**-3456

Production queries, role-based views

Tokenization

4111-1111-1111-1111

→

tok_a8f3b2e1d9c7

Payment processing, reversible mapping

K-Anonymity

Age 34, ZIP 02139

→

Age 30-39, ZIP 021**

Analytics, research datasets

Differential Privacy

Salary: $95,000

→

Salary: $93,200 (± noise)

Aggregate statistics, ML training

Reversible

Tokenization, encryption

Irreversible

Hashing, generalization

Statistical

Differential privacy, noise

Structural

K-anonymity, l-diversity

What is the key difference between data masking and encryption?

2. Real-World Scenario

The Scenario: You're reviewing a fintech application that processes loan applications. The development team needs production-like data for testing, the analytics team needs customer data for reporting, and a third-party ML vendor needs training data. All three requests involve PII.

❌ Vulnerable: Common Data Masking Failures

1// --- Test Data Seeding Script ---
2async function seedTestDatabase() {
3  // ❌ Copies production data directly to test environment
4  const prodUsers = await prodDb.query('SELECT * FROM users');
5  for (const user of prodUsers) {
6    await testDb.query(
7      'INSERT INTO users VALUES ($1, $2, $3, $4, $5)',
8      [user.id, user.name, user.email, user.ssn, user.credit_score]
9    );
10  }
11  console.log(`Seeded ${prodUsers.length} real users to test DB`);
12}
13
14// --- Analytics Export Endpoint ---
15app.get('/api/analytics/export', adminOnly, async (req, res) => {
16  const data = await db.query(`
17    SELECT name, email, date_of_birth, zip_code, loan_amount,
18           income, credit_score, employment_status
19    FROM loan_applications
20    WHERE created_at > $1
21  `, [req.query.since]);
22
23  // ❌ Full PII sent to analytics — "we'll mask it in the BI tool"
24  res.csv(data);
25});
26
27// --- ML Training Data Export ---
28app.get('/api/ml/training-data', apiKeyAuth, async (req, res) => {
29  const applications = await db.query(
30    'SELECT * FROM loan_applications LIMIT 100000'
31  );
32
33  // ❌ "Masking" by removing the name field but keeping everything else
34  const "masked" = applications.map(app => {
35    const { name, ...rest } = app;
36    return rest;
37  });
38
39  // ❌ email + DOB + zip code = re-identification
40  res.json(masked);
41});
42
43// --- Log Masking Attempt ---
44function maskSensitiveData(data: any) {
45  // ❌ Only masks exact field names — misses variations
46  if (data.ssn) data.ssn = '***-**-****';
47  if (data.password) data.password = '[REDACTED]';
48  return data;
49  // Misses: socialSecurityNumber, SSN, social_security,
50  //         pass, passwd, pwd, secret, creditCard, etc.
51}

Five Critical Failures

This code contains: 1) Production data cloned to test environments without masking — a single test DB breach exposes all real customers. 2) PII exported raw for analytics with the promise of "masking later" — data is already exposed the moment it leaves the secure boundary. 3) Naive masking that removes names but keeps quasi-identifiers (email + DOB + zip = re-identification). 4) Incomplete field matching — only exact field names are caught, missing common variations. 5) No formal anonymization guarantees — no k-anonymity, no differential privacy, no verification that the output cannot be reversed.

An analytics export removes the 'name' field but includes email, date_of_birth, and zip_code. Is this data properly anonymized?

3. Data Masking Techniques

Data masking replaces sensitive values with realistic but fake alternatives. There are two primary modes: static masking (applied once to create a sanitized copy of the data) and dynamic masking (applied at query time based on the requesting user's role).

Masking Techniques Comparison

Technique	How It Works	Reversible?	Best For
Substitution	Replace real values with realistic fake data from a lookup table	No	Names, addresses, emails in test environments
Shuffling	Randomly swap values between records within the same column	No (without the shuffle key)	Preserving statistical distribution while breaking identity links
Character Masking	Replace characters with a fixed symbol (e.g., * or X)	No	Display masking: SSN, credit cards, phone numbers in UI
Numeric Variance	Add random noise within a defined range (±5%, ±10)	No	Salaries, ages, financial figures for analytics
Date Shifting	Shift dates by a consistent random offset per record	No (without the offset)	Preserving time intervals while hiding actual dates
Nulling Out	Replace sensitive values with NULL or empty string	No	Fields not needed in the target environment at all
Format-Preserving Masking	Replace value with a fake that matches the same format/regex	Depends on method	Credit cards (must pass Luhn check), phone numbers, IDs

❌ Vulnerable: Naive Masking Implementations

1// ❌ Pattern 1: Incomplete character masking
2function maskEmail(email: string): string {
3  // Only masks the local part — domain reveals the company
4  return email.replace(/^.+@/, '****@');
5  // "john.doe@secretstartup.com" → "****@secretstartup.com"
6  // ❌ The domain is still identifying!
7}
8
9// ❌ Pattern 2: Deterministic masking without salt
10function maskSSN(ssn: string): string {
11  // Same input always produces the same output
12  const hash = crypto.createHash('md5').update(ssn).digest('hex');
13  return hash.slice(0, 3) + '-' + hash.slice(3, 5) + '-' + hash.slice(5, 9);
14  // ❌ Attacker can pre-compute all 1 billion SSNs and reverse the mapping
15}
16
17// ❌ Pattern 3: Preserving too much structure
18function maskPhoneNumber(phone: string): string {
19  // Keeps area code — reveals geographic location
20  return phone.slice(0, 3) + '-***-****';
21  // "415-555-1234" → "415-***-****"
22  // ❌ Area code narrows down to San Francisco
23}
24
25// ❌ Pattern 4: Regex-only masking for logs
26function maskLogEntry(log: string): string {
27  // Only catches one SSN format — misses "555 12 3456" or "555123456"
28  return log.replace(/\d{3}-\d{2}-\d{4}/g, '***-**-****');
29}

✅ Secure: Proper Masking Implementations

1import crypto from 'crypto';
2
3// ✅ Secure email masking — masks both local and domain parts
4function maskEmail(email: string): string {
5  const [local, domain] = email.split('@');
6  if (!local || !domain) return '***@***.***';
7  const maskedLocal = local[0] + '***';
8  const domainParts = domain.split('.');
9  const maskedDomain = domainParts[0][0] + '***.' + domainParts.slice(1).join('.');
10  return maskedLocal + '@' + maskedDomain;
11  // "john.doe@secretstartup.com" → "j***@s***.com"
12}
13
14// ✅ Format-preserving masking with HMAC (keyed, not reversible without key)
15function maskSSN(ssn: string, maskingKey: string): string {
16  const hmac = crypto.createHmac('sha256', maskingKey).update(ssn).digest('hex');
17  return hmac.slice(0, 3) + '-' + hmac.slice(3, 5) + '-' + hmac.slice(5, 9);
18  // ✅ Without the maskingKey, pre-computation attacks are infeasible
19}
20
21// ✅ Phone masking — removes area code, keeps last 4 for usability
22function maskPhoneNumber(phone: string): string {
23  const digits = phone.replace(/\D/g, '');
24  if (digits.length < 4) return '***-***-****';
25  return '***-***-' + digits.slice(-4);
26}
27
28// ✅ Comprehensive PII masking for objects
29interface MaskingConfig {
30  key: string;
31  rules: Record<string, (value: string, key: string) => string>;
32}
33
34function createMasker(config: MaskingConfig) {
35  const fieldPatterns = new Map<RegExp, (value: string, key: string) => string>([
36    [/^(email|e_mail|email_address)$/i, (v) => maskEmail(v)],
37    [/^(ssn|social_security|social_security_number)$/i, (v) => maskSSN(v, config.key)],
38    [/^(phone|telephone|mobile|cell|phone_number)$/i, (v) => maskPhoneNumber(v)],
39    [/^(name|first_name|last_name|full_name)$/i, () => '[MASKED]'],
40    [/^(password|passwd|pwd|secret|token|api_key)$/i, () => '[REDACTED]'],
41    [/^(card_number|credit_card|cc_number)$/i, (v) => '****-****-****-' + v.slice(-4)],
42    [/^(address|street|street_address)$/i, () => '[MASKED ADDRESS]'],
43    [/^(date_of_birth|dob|birth_date|birthdate)$/i, (v) => v.slice(0, 4) + '-**-**'],
44  ]);
45
46  return function maskObject<T extends Record<string, unknown>>(obj: T): T {
47    const masked = { ...obj };
48    for (const [field, value] of Object.entries(masked)) {
49      if (typeof value !== 'string') continue;
50      for (const [pattern, maskFn] of fieldPatterns) {
51        if (pattern.test(field)) {
52          (masked as Record<string, unknown>)[field] = maskFn(value, field);
53          break;
54        }
55      }
56    }
57    return masked;
58  };
59}
60
61// Usage:
62// const masker = createMasker({ key: process.env.MASKING_KEY!, rules: {} });
63// const safeUser = masker(user);

A developer masks SSNs by hashing them with MD5 (no salt/key). Why is this insecure?

4. Tokenization

Tokenization replaces sensitive data with a non-sensitive surrogate (a "token") that maps back to the original through a secure token vault. Unlike encryption, tokens have no mathematical relationship to the original data — they cannot be reversed without access to the vault. Tokenization is the industry standard for PCI DSS compliance (credit card processing) and is increasingly used for PII protection.

Tokenization vs. Encryption vs. Masking

Property	Tokenization	Encryption	Masking
Reversible?	Yes (with vault access)	Yes (with key)	Usually no
Original stored where?	Secure token vault (separate system)	In place (as ciphertext)	Nowhere (destroyed)
Output format	Configurable (same length, same format)	Different format/length	Same or different format
Performance	Vault lookup per detokenization	CPU-bound crypto operation	Simple string transformation
PCI DSS scope	Token is out of scope	Ciphertext is in scope	Masked value is out of scope
Best for	Payment processing, PII that needs occasional retrieval	Data at rest that is frequently read	Test data, logs, analytics where original is never needed

❌ Vulnerable: Broken Tokenization

1// ❌ Pattern 1: "Tokenization" that's just encoding
2function tokenize(creditCard: string): string {
3  // ❌ Base64 is NOT tokenization — trivially reversible
4  return Buffer.from(creditCard).toString('base64');
5}
6
7// ❌ Pattern 2: Token derived from original data
8function tokenizeSSN(ssn: string): string {
9  // ❌ Deterministic hash — same SSN always maps to same token
10  // Enables frequency analysis and cross-dataset correlation
11  return 'tok_' + crypto.createHash('sha256').update(ssn).digest('hex').slice(0, 16);
12}
13
14// ❌ Pattern 3: Token vault with no access controls
15const tokenVault = new Map<string, string>(); // ❌ In-memory, no encryption
16function storeToken(token: string, original: string) {
17  tokenVault.set(token, original); // ❌ Anyone with memory access can dump the vault
18}
19
20// ❌ Pattern 4: Detokenization without audit logging
21app.get('/api/detokenize/:token', adminOnly, async (req, res) => {
22  const original = await vault.detokenize(req.params.token);
23  // ❌ No audit log of who accessed what and why
24  res.json({ value: original });
25});

✅ Secure: Proper Tokenization Implementation

1import crypto from 'crypto';
2
3interface TokenVaultConfig {
4  encryptionKey: Buffer;
5  db: Database;
6}
7
8class TokenVault {
9  private config: TokenVaultConfig;
10
11  constructor(config: TokenVaultConfig) {
12    this.config = config;
13  }
14
15  // ✅ Generate a cryptographically random token with no relation to original
16  async tokenize(originalValue: string, dataType: string): Promise<string> {
17    const token = 'tok_' + crypto.randomBytes(16).toString('hex');
18
19    // ✅ Encrypt the original before storing in the vault
20    const iv = crypto.randomBytes(16);
21    const cipher = crypto.createCipheriv('aes-256-gcm', this.config.encryptionKey, iv);
22    let encrypted = cipher.update(originalValue, 'utf8', 'hex');
23    encrypted += cipher.final('hex');
24    const authTag = cipher.getAuthTag().toString('hex');
25
26    // ✅ Store encrypted original in isolated vault database
27    await this.config.db.query(
28      `INSERT INTO token_vault (token, encrypted_value, iv, auth_tag, data_type, created_at)
29       VALUES ($1, $2, $3, $4, $5, NOW())`,
30      [token, encrypted, iv.toString('hex'), authTag, dataType]
31    );
32
33    return token;
34  }
35
36  // ✅ Detokenize with mandatory audit logging and access control
37  async detokenize(
38    token: string,
39    requestContext: { userId: string; reason: string; ticketId?: string }
40  ): Promise<string> {
41    const record = await this.config.db.query(
42      'SELECT encrypted_value, iv, auth_tag FROM token_vault WHERE token = $1',
43      [token]
44    );
45
46    if (!record) throw new Error('Token not found');
47
48    // ✅ Audit log every detokenization
49    await this.config.db.query(
50      `INSERT INTO detokenization_audit_log
51       (token, user_id, reason, ticket_id, timestamp, ip_address)
52       VALUES ($1, $2, $3, $4, NOW(), $5)`,
53      [token, requestContext.userId, requestContext.reason,
54       requestContext.ticketId, requestContext.userId]
55    );
56
57    // ✅ Decrypt the original value
58    const iv = Buffer.from(record.iv, 'hex');
59    const decipher = crypto.createDecipheriv('aes-256-gcm', this.config.encryptionKey, iv);
60    decipher.setAuthTag(Buffer.from(record.auth_tag, 'hex'));
61    let decrypted = decipher.update(record.encrypted_value, 'hex', 'utf8');
62    decrypted += decipher.final('utf8');
63
64    return decrypted;
65  }
66
67  // ✅ Batch tokenization for data pipelines
68  async tokenizeBatch(
69    records: Array<{ field: string; value: string }>,
70    dataType: string
71  ): Promise<Map<string, string>> {
72    const tokenMap = new Map<string, string>();
73    for (const record of records) {
74      const token = await this.tokenize(record.value, dataType);
75      tokenMap.set(record.value, token);
76    }
77    return tokenMap;
78  }
79}

A payment system 'tokenizes' credit card numbers by Base64-encoding them. Why is this not real tokenization?

5. Anonymization & Pseudonymization

Pseudonymization replaces identifying fields with artificial identifiers — it's reversible with a mapping table, so the data is still considered personal data under GDPR. Anonymization is irreversible: no one, including the data holder, can re-identify the individuals. GDPR does not apply to truly anonymized data. The distinction has massive regulatory implications.

Pseudonymization vs. Anonymization

Property	Pseudonymization	Anonymization
Reversible?	Yes (with mapping/key)	No — designed to be irreversible
GDPR applies?	Yes — still personal data	No — falls outside GDPR scope
Direct identifiers	Replaced with pseudonyms	Removed or destroyed
Quasi-identifiers	Often left intact	Generalized, suppressed, or perturbed
Use cases	Internal analytics, research with re-linking capability	Public datasets, third-party sharing, open data
Regulatory status	Meets GDPR Art. 25 "appropriate safeguard"	Exempt from GDPR (Recital 26)

❌ Vulnerable: Fake "Anonymization"

1// ❌ Pattern 1: Pseudonymization labeled as anonymization
2function "anonymize"(user: User) {
3  return {
4    id: crypto.randomUUID(),         // New random ID
5    name: '[ANONYMIZED]',            // Removed
6    email: '[ANONYMIZED]',           // Removed
7    // ❌ These quasi-identifiers enable re-identification:
8    dateOfBirth: user.dateOfBirth,   // ❌ Exact DOB kept
9    zipCode: user.zipCode,           // ❌ Full zip kept
10    gender: user.gender,             // ❌ Gender kept
11    // DOB + ZIP + gender = 87% re-identification rate
12    joinDate: user.joinDate,         // ❌ Exact join date kept
13    purchaseHistory: user.purchases, // ❌ Behavioral data is identifying
14  };
15}
16
17// ❌ Pattern 2: Consistent pseudonyms across datasets
18function pseudonymizeForExport(userId: string): string {
19  // Same user always gets the same pseudonym across ALL exports
20  return crypto.createHash('sha256').update(userId + 'static_salt').digest('hex');
21  // ❌ Enables cross-dataset linking — if attacker has one dataset
22  //    with known users, they can link to "anonymous" dataset
23}
24
25// ❌ Pattern 3: Insufficient suppression threshold
26function anonymizeSmallGroup(records: any[]) {
27  // Only 2 people aged 95+ in zip code 02139
28  // ❌ No k-anonymity check — groups of 1-2 are trivially identifiable
29  return records.map(r => ({
30    ageRange: Math.floor(r.age / 5) * 5 + '-' + (Math.floor(r.age / 5) * 5 + 4),
31    zipCode: r.zipCode,  // ❌ Full zip kept
32    condition: r.medicalCondition,
33  }));
34}

✅ Secure: Proper Anonymization with k-Anonymity

1interface AnonymizationConfig {
2  kThreshold: number;         // Minimum group size (typically k >= 5)
3  quasiIdentifiers: string[]; // Fields that could enable re-identification
4  sensitiveFields: string[];  // Fields to protect (the "payload")
5  suppressionLimit: number;   // Max % of records to suppress (typically 5%)
6}
7
8function anonymizeDataset<T extends Record<string, unknown>>(
9  records: T[],
10  config: AnonymizationConfig
11): T[] {
12  let anonymized = records.map(record => {
13    const result = { ...record };
14
15    // ✅ Remove all direct identifiers
16    delete result.name;
17    delete result.email;
18    delete result.ssn;
19    delete result.phone;
20    delete result.address;
21
22    // ✅ Generalize quasi-identifiers
23    if ('dateOfBirth' in result && typeof result.dateOfBirth === 'string') {
24      const year = new Date(result.dateOfBirth).getFullYear();
25      const decade = Math.floor(year / 10) * 10;
26      (result as Record<string, unknown>).dateOfBirth = `${decade}s`;
27    }
28
29    if ('zipCode' in result && typeof result.zipCode === 'string') {
30      (result as Record<string, unknown>).zipCode =
31        (result.zipCode as string).slice(0, 3) + '**';
32    }
33
34    if ('age' in result && typeof result.age === 'number') {
35      const bucket = Math.floor(result.age / 10) * 10;
36      (result as Record<string, unknown>).age = `${bucket}-${bucket + 9}`;
37    }
38
39    return result;
40  });
41
42  // ✅ Verify k-anonymity: every combination of quasi-identifiers
43  //    must appear in at least k records
44  const groups = new Map<string, T[]>();
45  for (const record of anonymized) {
46    const key = config.quasiIdentifiers
47      .map(qi => String(record[qi] ?? ''))
48      .join('|');
49    if (!groups.has(key)) groups.set(key, []);
50    groups.get(key)!.push(record);
51  }
52
53  // ✅ Suppress groups smaller than k (remove them entirely)
54  const suppressedCount = [...groups.values()]
55    .filter(g => g.length < config.kThreshold)
56    .reduce((sum, g) => sum + g.length, 0);
57
58  if (suppressedCount / records.length > config.suppressionLimit) {
59    throw new Error(
60      'Suppression would remove ' +
61      (suppressedCount / records.length * 100).toFixed(1) +
62      '% of records (limit: ' +
63      (config.suppressionLimit * 100) +
64      '%). Increase generalization before exporting.'
65    );
66  }
67
68  anonymized = anonymized.filter(record => {
69    const key = config.quasiIdentifiers
70      .map(qi => String(record[qi] ?? ''))
71      .join('|');
72    return (groups.get(key)?.length ?? 0) >= config.kThreshold;
73  });
74
75  return anonymized;
76}
77
78// Usage:
79// const safeData = anonymizeDataset(patients, {
80//   kThreshold: 5,
81//   quasiIdentifiers: ['dateOfBirth', 'zipCode', 'gender'],
82//   sensitiveFields: ['diagnosis', 'treatment'],
83//   suppressionLimit: 0.05,
84// });

A dataset has been 'anonymized' by replacing names with random IDs but keeping exact date of birth, full zip code, and gender. Under GDPR, is this anonymization or pseudonymization?

6. Implementation Patterns

Masking must be applied consistently at every boundary where data leaves the secure perimeter: API responses, log output, error messages, analytics pipelines, data exports, and non-production environments. A centralized masking layer prevents the common failure mode of masking in some places but not others.

✅ Centralized Masking Middleware for APIs

1type SensitivityLevel = 'public' | 'internal' | 'confidential' | 'restricted';
2
3interface FieldPolicy {
4  sensitivity: SensitivityLevel;
5  maskFn: (value: unknown) => unknown;
6}
7
8// ✅ Define masking policies per field at the schema level
9const USER_FIELD_POLICIES: Record<string, FieldPolicy> = {
10  id:           { sensitivity: 'internal', maskFn: v => v },
11  displayName:  { sensitivity: 'public',   maskFn: v => v },
12  email:        { sensitivity: 'confidential', maskFn: v => maskEmail(v as string) },
13  phone:        { sensitivity: 'confidential', maskFn: v => maskPhoneNumber(v as string) },
14  ssn:          { sensitivity: 'restricted',   maskFn: () => '[RESTRICTED]' },
15  passwordHash: { sensitivity: 'restricted',   maskFn: () => undefined }, // never include
16  dateOfBirth:  { sensitivity: 'confidential', maskFn: v => (v as string).slice(0, 4) + '-**-**' },
17  salary:       { sensitivity: 'restricted',   maskFn: () => '[RESTRICTED]' },
18  role:         { sensitivity: 'internal',     maskFn: v => v },
19};
20
21// ✅ Apply masking based on requester's clearance level
22function applyMasking(
23  data: Record<string, unknown>,
24  policies: Record<string, FieldPolicy>,
25  requesterClearance: SensitivityLevel
26): Record<string, unknown> {
27  const clearanceLevels: Record<SensitivityLevel, number> = {
28    public: 0, internal: 1, confidential: 2, restricted: 3,
29  };
30
31  const result: Record<string, unknown> = {};
32
33  for (const [field, value] of Object.entries(data)) {
34    const policy = policies[field];
35    if (!policy) continue; // ✅ Unlisted fields are excluded by default
36
37    if (clearanceLevels[requesterClearance] >= clearanceLevels[policy.sensitivity]) {
38      result[field] = value; // Requester has clearance — return original
39    } else {
40      const masked = policy.maskFn(value);
41      if (masked !== undefined) result[field] = masked; // Apply masking
42    }
43  }
44
45  return result;
46}
47
48// ✅ API middleware that applies masking automatically
49function maskingMiddleware(policies: Record<string, FieldPolicy>) {
50  return (req: Request, res: Response, next: NextFunction) => {
51    const originalJson = res.json.bind(res);
52    const clearance = getUserClearance(req.user);
53
54    res.json = (body: unknown) => {
55      if (Array.isArray(body)) {
56        return originalJson(
57          body.map(item =>
58            typeof item === 'object' && item !== null
59              ? applyMasking(item as Record<string, unknown>, policies, clearance)
60              : item
61          )
62        );
63      }
64      if (typeof body === 'object' && body !== null) {
65        return originalJson(
66          applyMasking(body as Record<string, unknown>, policies, clearance)
67        );
68      }
69      return originalJson(body);
70    };
71    next();
72  };
73}

✅ Static Masking for Test Environments

1import { faker } from '@faker-js/faker';
2
3interface StaticMaskingRule {
4  column: string;
5  generator: () => string | number;
6}
7
8// ✅ Generate realistic but fake data for each field
9const maskingRules: Record<string, StaticMaskingRule[]> = {
10  users: [
11    { column: 'first_name', generator: () => faker.person.firstName() },
12    { column: 'last_name',  generator: () => faker.person.lastName() },
13    { column: 'email',      generator: () => faker.internet.email() },
14    { column: 'phone',      generator: () => faker.phone.number() },
15    { column: 'ssn',        generator: () => faker.string.numeric('###-##-####') },
16    { column: 'date_of_birth', generator: () =>
17      faker.date.between({ from: '1950-01-01', to: '2005-12-31' }).toISOString().slice(0, 10) },
18    { column: 'address',    generator: () => faker.location.streetAddress() },
19    { column: 'password_hash', generator: () => '$2b$12$invalidhashfortesting' },
20  ],
21  loan_applications: [
22    { column: 'applicant_name', generator: () => faker.person.fullName() },
23    { column: 'income',         generator: () =>
24      Math.round(faker.number.int({ min: 30000, max: 200000 }) / 1000) * 1000 },
25    { column: 'credit_score',   generator: () =>
26      faker.number.int({ min: 300, max: 850 }) },
27  ],
28};
29
30// ✅ Apply static masking to a full database export
31async function createMaskedTestDatabase(
32  sourceDb: Database,
33  targetDb: Database,
34  rules: Record<string, StaticMaskingRule[]>
35) {
36  for (const [table, tableRules] of Object.entries(rules)) {
37    const rows = await sourceDb.query(`SELECT * FROM ${table}`);
38    const columnMap = new Map(tableRules.map(r => [r.column, r.generator]));
39
40    for (const row of rows) {
41      const maskedRow = { ...row };
42      for (const [col, generator] of columnMap) {
43        if (col in maskedRow) maskedRow[col] = generator();
44      }
45      await targetDb.insert(table, maskedRow);
46    }
47
48    console.log(`Masked ${rows.length} rows in ${table}`);
49  }
50}

A masking middleware only masks fields that are explicitly listed in its policy. Unlisted fields pass through unchanged. What is the security risk?

7. Code Review Defenses

Data Masking Code Review Principles

1) Mask at the boundary: Apply masking where data exits the secure perimeter (API responses, exports, logs, test data). 2) Use allow-lists, not deny-lists: Only include explicitly declared fields in output — don't try to enumerate every sensitive field to block. 3) Classify at the schema level: Annotate fields with sensitivity levels in the data model, not in individual endpoints. 4) Test masking: Write automated tests that verify sensitive fields are properly masked in API responses. 5) Separate environments: Never use production data in non-production environments without static masking. 6) Audit detokenization: Every time original data is recovered from a token, log who, when, and why.

✅ Automated Masking Verification Tests

1describe('API response masking', () => {
2  it('should not expose restricted fields to regular users', async () => {
3    const response = await request(app)
4      .get('/api/users/123')
5      .set('Authorization', 'Bearer regular_user_token');
6
7    // ✅ Verify restricted fields are absent or masked
8    expect(response.body).not.toHaveProperty('passwordHash');
9    expect(response.body).not.toHaveProperty('password_hash');
10    expect(response.body).not.toHaveProperty('ssn');
11    expect(response.body).not.toHaveProperty('salary');
12    expect(response.body).not.toHaveProperty('internalNotes');
13
14    // ✅ Verify confidential fields are masked
15    if (response.body.email) {
16      expect(response.body.email).toMatch(/^.\*\*\*@.\*\*\*\..+$/);
17    }
18    if (response.body.phone) {
19      expect(response.body.phone).toMatch(/^\*\*\*-\*\*\*-\d{4}$/);
20    }
21  });
22
23  it('should mask PII in log output', () => {
24    const logSpy = jest.spyOn(logger, 'info');
25
26    // Trigger a log event
27    processUserRequest({
28      email: 'john@example.com',
29      ssn: '123-45-6789',
30      password: 'secret123',
31    });
32
33    const loggedData = logSpy.mock.calls[0][1];
34
35    // ✅ Verify no raw PII in logs
36    expect(JSON.stringify(loggedData)).not.toContain('john@example.com');
37    expect(JSON.stringify(loggedData)).not.toContain('123-45-6789');
38    expect(JSON.stringify(loggedData)).not.toContain('secret123');
39  });
40
41  it('should enforce k-anonymity on analytics exports', async () => {
42    const response = await request(app)
43      .get('/api/analytics/export?since=2025-01-01')
44      .set('Authorization', 'Bearer analyst_token');
45
46    const records = response.body.data;
47
48    // ✅ Verify no direct identifiers
49    for (const record of records) {
50      expect(record).not.toHaveProperty('name');
51      expect(record).not.toHaveProperty('email');
52      expect(record).not.toHaveProperty('ssn');
53      expect(record).not.toHaveProperty('phone');
54    }
55
56    // ✅ Verify quasi-identifiers are generalized
57    for (const record of records) {
58      if (record.zipCode) {
59        expect(record.zipCode).toMatch(/^\d{3}\*\*$/);
60      }
61      if (record.dateOfBirth) {
62        expect(record.dateOfBirth).toMatch(/^\d{4}s$/);
63      }
64    }
65
66    // ✅ Verify k-anonymity (k >= 5)
67    const groups = new Map<string, number>();
68    for (const record of records) {
69      const key = [record.dateOfBirth, record.zipCode, record.gender].join('|');
70      groups.set(key, (groups.get(key) || 0) + 1);
71    }
72    for (const [, count] of groups) {
73      expect(count).toBeGreaterThanOrEqual(5);
74    }
75  });
76});

✅ CI Pipeline Masking Verification

1// ✅ Pre-deploy check: scan API schemas for unmasked sensitive fields
2function auditResponseSchemas(schemas: Record<string, ResponseSchema>): AuditResult[] {
3  const sensitivePatterns = [
4    /password/i, /passwd/i, /secret/i,
5    /ssn/i, /social.?security/i,
6    /credit.?card/i, /card.?number/i, /cvv/i,
7    /api.?key/i, /private.?key/i, /token/i,
8    /date.?of.?birth/i, /dob/i,
9  ];
10
11  const violations: AuditResult[] = [];
12
13  for (const [endpoint, schema] of Object.entries(schemas)) {
14    for (const field of schema.fields) {
15      for (const pattern of sensitivePatterns) {
16        if (pattern.test(field.name) && !field.masked) {
17          violations.push({
18            endpoint,
19            field: field.name,
20            severity: 'critical',
21            message: `Potentially sensitive field '${field.name}' is not masked in ${endpoint}`,
22          });
23        }
24      }
25    }
26  }
27
28  return violations;
29}

Which of these is the most effective way to ensure data masking is consistently applied across all API endpoints?

Security Code Review Learning Path

Data Masking & Anonymization Code Review Guide

Table of Contents

1. Introduction to Data Masking & Anonymization

Why Masking Is Not Optional

Data Masking Techniques at a Glance

From Raw PII to Safe Output

2. Real-World Scenario

❌ Vulnerable: Common Data Masking Failures

Five Critical Failures

3. Data Masking Techniques

Masking Techniques Comparison

❌ Vulnerable: Naive Masking Implementations

✅ Secure: Proper Masking Implementations

4. Tokenization

Tokenization vs. Encryption vs. Masking

❌ Vulnerable: Broken Tokenization

✅ Secure: Proper Tokenization Implementation

5. Anonymization & Pseudonymization

Pseudonymization vs. Anonymization

❌ Vulnerable: Fake "Anonymization"

✅ Secure: Proper Anonymization with k-Anonymity

6. Implementation Patterns

✅ Centralized Masking Middleware for APIs

✅ Static Masking for Test Environments

7. Code Review Defenses

Data Masking Code Review Principles

✅ Automated Masking Verification Tests

✅ CI Pipeline Masking Verification

Unlock Full Access

Security Code Review Learning Path

Data Masking & Anonymization Code Review Guide | Secure Coding

Data Masking & Anonymization Code Review Guide

Table of Contents

1. Introduction to Data Masking & Anonymization

Why Masking Is Not Optional

Data Masking Techniques at a Glance

From Raw PII to Safe Output

2. Real-World Scenario

❌ Vulnerable: Common Data Masking Failures

Five Critical Failures

3. Data Masking Techniques

Masking Techniques Comparison

❌ Vulnerable: Naive Masking Implementations

✅ Secure: Proper Masking Implementations

4. Tokenization

Tokenization vs. Encryption vs. Masking

❌ Vulnerable: Broken Tokenization

✅ Secure: Proper Tokenization Implementation

5. Anonymization & Pseudonymization

Pseudonymization vs. Anonymization

❌ Vulnerable: Fake "Anonymization"

✅ Secure: Proper Anonymization with k-Anonymity

6. Implementation Patterns

✅ Centralized Masking Middleware for APIs

✅ Static Masking for Test Environments

7. Code Review Defenses

Data Masking Code Review Principles

✅ Automated Masking Verification Tests

✅ CI Pipeline Masking Verification

Unlock Full Access