Parser Differentials Code Review Guide
Table of Contents
Introduction
A parser differential is a bug in the seam between two components. Both components read the same bytes; neither is obviously wrong in isolation; but they disagree on what the bytes mean, and an attacker who knows the disagreement can route a security decision through component A while routing the actual operation through component B. Because the bug lives in the protocol interpretation rather than in any single line of code, it slips past unit tests, slips past code review that focuses on one service at a time, and slips past WAFs — WAFs are themselves a prime source of differentials.
Every famous bug class below is a parser differential in disguise: HTTP request smuggling (proxy vs origin disagree on message framing), SSRF via URL validation bypass (validator vs fetcher disagree on hostname), SAML signature wrapping (signer vs consumer disagree on which <Assertion> is "the" assertion), HTTP/2 downgrade smuggling (HTTP/2 layer vs HTTP/1 back-end disagree on length), JWT algorithm confusion in some libraries (header parser vs key resolver disagree on the algorithm). The common review discipline is: wherever two parsers sit in the same trust chain, assume they disagree somewhere, and make sure that disagreement cannot become an escalation.
Signed XML is not protected XML
The canonical sentence to internalise: valid signature means "the bytes the signer signed match the bytes the verifier canonicalised," not "the document the consumer ends up reading is what the signer intended." If the consumer walks to a different node than the verifier covered, the signature is valid and the content is attacker-controlled. Same pattern applies to JWT, JWS, XMLDSIG, COSE, and any other signed-envelope format.
One Byte Stream, Two Interpretations
POST /api/transfer HTTP/1.1
Host: bank.example
Content-Length: 13
Transfer-Encoding: chunked
0
GET /admin HTTP/1.1
Host: bank.example- Reads
Transfer-Encoding: chunked - Stops at the
0\r\n\r\nterminator - Authorizes the POST to /api/transfer
- Forwards the stream to the origin
- Reads
Content-Length: 13 - Consumes 13 bytes of body, stops there
- Treats the remaining bytes as the next request
- Sees a smuggled GET /admin with no auth layer in front of it
The pattern: a parser differential is when two components in the same trust chain read the same bytes and reach different conclusions about what they mean. Security decisions are made on interpretation A, the actual operation runs against interpretation B.
The diagram above is the shape of every parser-differential bug, even ones that look nothing like HTTP smuggling on the surface. Replace "front-end proxy" with "validator," "WAF," "signature verifier," "audit logger," or "rate limiter." Replace "origin app" with "file system," "database," "HTTP client," or "authorization handler." The structural invariant — two parsers, one byte stream, divergent interpretations — is what matters.
A team reports a bug: the WAF blocks requests containing '../' but the origin still serves /etc/passwd when the path contains '..%2f'. Which framing is correct?
HTTP Request Smuggling
HTTP/1.1 has two ways to tell the parser where a request body ends: Content-Length (count bytes) and Transfer-Encoding: chunked (read zero-length terminator). RFC 7230 says a message containing both is malformed and must be rejected, and that when Transfer-Encoding is present it takes precedence. In practice, every proxy, load balancer, CDN, WAF, and application server implements this slightly differently. The attack surface is the combinatorial disagreement between whatever two implementations sit next to each other in your trust chain.
Four Smuggling Flavors at a Glance
Content-Length; back-end uses Transfer-Encoding.Transfer-Encoding; back-end falls back to Content-Length.Transfer-Encoding — but parse an obfuscated variant differently.Transfer-Encoding : chunked, Transfer-Encoding: xchunked, duplicate TE headers, tab vs space.Content-Length the attacker injected into a pseudo-header.CL.TE smuggling payload
1POST / HTTP/1.1
2Host: front-end.example
3Content-Length: 6
4Transfer-Encoding: chunked
5
60
7
8GET /admin HTTP/1.1
9Host: front-end.example
10Content-Length: 10
11
12x=yThe front-end uses Content-Length: 6, reads 6 bytes (0\r\n\r\n), forwards the whole buffer, and moves on. The back-end uses Transfer-Encoding: chunked, reads the zero-length chunk, finishes the first request, and starts parsing a second request: GET /admin HTTP/1.1. That GET was never seen by the front-end's auth layer, ACL, or WAF. It inherits the TCP connection's trust context and, on a socket that's reused by another user, it can also be spliced onto the next victim's request — which is how smuggling escalates to request hijacking.
TE.CL: the mirror case
1POST / HTTP/1.1
2Host: front-end.example
3Content-Length: 4
4Transfer-Encoding: chunked
5
65e
7POST /admin HTTP/1.1
8Host: back-end.example
9Content-Length: 15
10
11x=1
120
13
14Here the front-end trusts Transfer-Encoding and reads the chunked body to completion, including the smuggled POST. The back-end trusts Content-Length: 4 and stops after 5e\r\n, treating everything after as the next request. TE.CL is usually harder to land than CL.TE because it requires the front-end to willingly accept chunked encoding, which well-hardened CDNs often normalise or reject.
HTTP/2 downgrade smuggling (H2.CL)
1# HTTP/2 request (to the front-end)
2:method POST
3:path /api
4:authority front.example
5content-length 0 # the H2 frame says the body is 0 bytes
6
7# Attacker-injected pseudo-header
8content-length 34 # but there is a shadow Content-Length
9
10# HTTP/2 body (rewritten by the front-end into an HTTP/1.1 request)
11POST /admin HTTP/1.1
12Host: front.example
13
14# Back-end sees HTTP/1.1 with Content-Length: 34 and happily parses a
15# second request out of the smuggled bytes.H2.CL is the modern footgun
Most smuggling incidents after 2021 are HTTP/2-to-HTTP/1 downgrade bugs. The front-end speaks HTTP/2 end-to-end and translates to HTTP/1.1 at the back-end edge. Attacker-controlled pseudo-headers (like an injected content-length) survive the translation and collide with the HTTP/2 frame length that the front-end already used for routing. Fix: reject HTTP/2 requests with conflicting framing metadata, or run HTTP/2 end-to-end.
Mitigations, ranked by how well they actually work
| Mitigation | What it does | Effectiveness |
|---|---|---|
| HTTP/2 end-to-end | Eliminates the downgrade seam entirely — framing is unambiguous at both hops. | Best. Removes the bug class. |
| Reject ambiguous messages | Front-end rejects any request with both CL and TE, duplicate CL, TE with obfuscation, or TE on a non-chunked body. | Very good. Removes the most exploited variants. |
| Normalise then re-emit | Front-end parses the request to an internal representation and rewrites it on the back-end side — no byte-level passthrough. | Good if implemented consistently, but easy to bypass with novel input. |
| Close connection after each request | Prevents smuggled bytes from reaching the next user's connection. | Partial — blocks request hijacking but not same-request privilege escalation. |
| WAF signature rules | Match known payload shapes (e.g. <code>Transfer-Encoding : chunked</code>). | Weak. Attackers iterate faster than signatures. |
A reviewer notices the back-end framework tolerates duplicate Content-Length headers by taking the last one. The front-end CDN uses the first. Nothing else is unusual. Is this exploitable?
URL Parser Disagreements
A URL looks like a simple string with a predictable grammar. It is not. RFC 3986 (generic URIs), WHATWG URL Living Standard (what browsers actually implement), and the dozen library implementations that came before either of those finalised all make slightly different choices about how to handle userinfo, percent-encoding, backslashes, IDN, and fragments. Any application that validates a URL in one library and fetches it in another is a candidate for SSRF.
The Same URL, Three Different Hosts
http://evil.example\@internal.corp/\, treats it as part of the userinfo.internal.corp\ to / (for http:).evil.exampleevil.example, and allows the request. The HTTP client then parses the same URL with a different library, resolves internal.corp, and sends the request to the internal host. The validator and the fetcher must parse with the same rules, or — better — the validator must pass the parsed object, not the string, to the fetcher.The SSRF pattern: validate here, fetch there
1# Pseudocode seen in real codebases. Both halves are "correct" in isolation.
2from urllib.parse import urlparse
3import requests
4
5ALLOWED = {"images.example.com", "cdn.example.com"}
6
7def fetch_user_supplied_url(url: str) -> bytes:
8 # Validator: urllib.parse
9 host = urlparse(url).hostname
10 if host not in ALLOWED:
11 raise PermissionError(f"host {host!r} not in allowlist")
12
13 # Fetcher: requests (which uses urllib3, which re-parses the URL).
14 return requests.get(url, timeout=5).content
15
16# Attacker input:
17# http://images.example.com\@internal.corp/secret
18#
19# urlparse says hostname = "internal.corp" -> BLOCKED (safe!)
20# OR urlparse says hostname = "images.example.com" -> ALLOWED
21# requests resolves hostname = "internal.corp" -> FETCHES internal host
22#
23# Different stdlib versions give different answers for the same input.
24# You cannot rely on "my validator said no" if the fetcher re-parses.The robust fix is not to pick a better URL library. It is to change the interface: validate against the parsed object, then pass the parsed object (or an IP address resolved against an allowlist) to the fetcher, so no second parse ever happens. If the fetcher insists on a string, serialize the parsed object back to a string yourself, so the bytes the fetcher sees are ones you produced from a trusted representation, not ones the attacker supplied.
http://trusted.example\@evil.example/— backslash in userinfo (parsers split differently).http://trusted.example#@evil.example/— fragment confusion (some parsers treat#as userinfo boundary).http://trusted.example:80@evil.example/— port-in-userinfo (the @ sign steals the hostname).http://evil.example/?@trusted.example/— query string masquerading as host (naive regex validators).http://[::ffff:127.0.0.1]/— IPv4-mapped IPv6 (allowlist checks on hostname string miss the loopback).http://2130706433/— decimal-encoded 127.0.0.1 (host allowlist string match misses).http://xn--exampl-gva.com/— IDN/punycode (validator sees ASCII, browser/resolver sees Unicode).http://trusted.example.EVIL.example/— suffix allowlist check usingendsWithforgets the dot.
Orange Tsai, "A New Era of SSRF" (BlackHat USA 2017)
The definitive talk. Demonstrates that every major language's URL parser disagrees with every other on at least one input, and that the attacker only needs the gap between the specific validator and the specific fetcher. The take-away is not "use library X," it is "do not reparse attacker-controlled URLs across a trust boundary."
Robust pattern: resolve once, pass structured data
1from urllib.parse import urlparse, urlunparse
2import ipaddress
3import socket
4
5ALLOWED_HOSTS = {"images.example.com", "cdn.example.com"}
6
7def safe_fetch(url: str) -> bytes:
8 # 1. Parse ONCE. Everything after this point operates on the parsed object.
9 parsed = urlparse(url)
10
11 # 2. Reject schemes, userinfo, and anything exotic up front.
12 if parsed.scheme not in ("http", "https"):
13 raise ValueError("scheme not allowed")
14 if parsed.username or parsed.password:
15 raise ValueError("userinfo not allowed")
16 if not parsed.hostname:
17 raise ValueError("no hostname")
18
19 # 3. Allowlist on the parsed hostname (case-fold, no suffix tricks).
20 host = parsed.hostname.lower()
21 if host not in ALLOWED_HOSTS:
22 raise PermissionError(f"host {host!r} not allowed")
23
24 # 4. Resolve to an IP here, and block RFC 1918 / loopback / link-local.
25 ip = ipaddress.ip_address(socket.gethostbyname(host))
26 if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
27 raise PermissionError("internal IP")
28
29 # 5. Re-serialize from the parsed object, so the fetcher cannot re-parse
30 # attacker bytes. Pin the hostname to the resolved IP via a custom
31 # transport if the library supports it.
32 safe_url = urlunparse((parsed.scheme, host, parsed.path or "/", "", parsed.query, ""))
33 return fetch_with_pinned_ip(safe_url, ip)A reviewer sees `if url.startswith('https://api.trusted.com')` gating an outbound fetch. Why is this insufficient even if the URL is later parsed by a library?
Path Normalization
Every web stack normalises request paths before authorization, routing, or filesystem access. Every web stack normalises differently. When two components normalise a path using different rules, and one of them uses the result for access control while the other uses it for action, you get authorization bypass (reach an admin endpoint the WAF thought was /public), path traversal (read /etc/passwd through a filter that never saw a ..), or static-asset leakage (serve source files that the framework thought it was routing to a handler).
Path Normalization Quirks by Stack
| Input | Interpreted as | Stack |
|---|---|---|
| <code>/admin/..;/public/</code> | <code>/admin/</code> | Tomcat — <code>;</code> starts a path parameter and is stripped after normalisation. |
| <code>/public/..%2f/admin/</code> | <code>/admin/</code> after decode | Many frameworks — percent-decode happens after the proxy's access check. |
| <code>/admin/./</code> | <code>/admin/</code> | All stacks — but some allowlists compare the raw bytes before collapse. |
| <code>/admin//</code> | <code>/admin/</code> | Most stacks collapse double slashes; a WAF that does not sees a "different" URL. |
| <code>/AdMiN/</code> | <code>/admin/</code> | Case-insensitive on Windows/IIS, case-sensitive on Linux/nginx. Reverse proxies can disagree with origin. |
| <code>/admin/index.php/</code> | <code>/admin/index.php</code> | PHP path-info handling — the trailing segment is consumed by the script, not the router. |
| <code>/static/..%252f/etc/passwd</code> | <code>/etc/passwd</code> after double-decode | Any stack that decodes percent-escapes twice (once in the router, once in a downstream layer). |
| <code>/%2e%2e/admin/</code> | <code>/../admin/</code> after decode | WAF blocks <code>../</code> but not <code>%2e%2e</code>; origin decodes before resolving. |
The pattern in code
1# Flask app behind nginx. nginx is configured to require auth on /admin.
2# The app trusts nginx for authn and just serves handlers based on path.
3
4# nginx.conf
5# location /admin { auth_request /auth; proxy_pass http://upstream; }
6
7# /admin/..;/public/feedback -> nginx sees /public/feedback, skips auth.
8# Tomcat (or any app that strips ;params) sees /admin/.
9# Attacker reaches /admin/ with no auth_request in front.
10
11# Fix:
12# 1. Normalise once, at the edge, before ANY auth decision.
13# 2. Use exact-path allowlists, not prefix allowlists.
14# 3. Reject requests whose normalised path differs from the raw path unless
15# you have an explicit reason to allow rewriting.Never authorize on the raw string, never execute on the normalised one
The inverse pattern is just as dangerous and much more common: a server normalises the path for routing (so /admin/../../public/x routes to /public/x), then hands the original string to a filesystem call that re-resolves ... The check sees a safe path, the syscall sees /etc/passwd. Always check on the same representation you operate on.
Safe path resolution in Node
1import path from 'node:path';
2import fs from 'node:fs/promises';
3
4const SAFE_ROOT = '/var/app/uploads';
5
6export async function serveFile(userPath) {
7 // 1. Resolve to an absolute path (collapses ..).
8 const absolute = path.resolve(SAFE_ROOT, userPath);
9
10 // 2. Assert the result is STILL within the safe root.
11 // path.resolve will happily return /etc/passwd if userPath is "../../etc/passwd".
12 // Trailing separator matters to avoid /var/app/uploads-evil passing the prefix check.
13 if (!absolute.startsWith(SAFE_ROOT + path.sep) && absolute !== SAFE_ROOT) {
14 throw new Error('path traversal');
15 }
16
17 // 3. Use the resolved path for BOTH the check and the operation.
18 // Do not re-open based on the original userPath.
19 return await fs.readFile(absolute);
20}A code-review diff adds a WAF rule: `location ~* \\.\\./ { return 403; }`. The team believes this blocks path traversal. What is the problem?
JSON Parser Quirks
JSON is often described as "a simple data format." It is not. RFC 8259 leaves significant behaviour undefined (duplicate keys, number precision, encoding), and every major parser makes different choices. When JSON crosses a trust boundary — signed in one service, verified in another, or rewritten by a proxy and consumed by an app — those choices become parser differentials.
JSON Behaviours That Differ Across Parsers
| Behaviour | Typical outcome | Where it hurts |
|---|---|---|
| Duplicate keys | JavaScript / Python json / Go encoding/json: last wins. Jackson: configurable (default last). Some Go parsers error. JWT libraries: varies. | Signed-by-first-parser / consumed-by-last-parser — attacker hides a second value after the signed one. |
| Numeric precision | JavaScript: IEEE-754 double only — 9007199254740993 becomes 9007199254740992. Python / Java BigInteger / Go json.Number: exact. | Monetary amounts, user IDs, nonces that cross JS and non-JS services silently lose precision. |
| Trailing bytes after root value | JavaScript JSON.parse: error. Python json.loads: error. Go encoding/json default: ignores. Some XML→JSON bridges: concatenate. | Signature covers bytes 0..N; consumer also reads bytes N..end. Smuggling via trailer. |
| Comments / trailing commas | Strict parsers: error. JSON5, JSONC, some YAML parsers: allowed. "Permissive" JSON libs: allowed. | Serialize-then-parse pipelines can lose or gain fields between the "strict" verifier and the "permissive" consumer. |
| Unicode escapes | Most parsers decode <code>\u0000</code> to literal NUL. Some databases truncate at NUL. Some validators check the escaped string, app reads the decoded one. | Bypass string-length checks and null-byte injection into backend stores. |
| __proto__ and constructor keys | Node's JSON.parse produces objects with a <code>__proto__</code> property that, when merged with <code>Object.assign</code> or similar, pollutes Object.prototype. | Prototype pollution via JSON input. |
| BOM / leading whitespace | JSON.parse rejects BOM. Some Java parsers ignore. Some Go parsers error. | Signature verifier strips BOM, consumer does not (or vice versa); canonicalisation breaks. |
Duplicate-key smuggling in a signed-JSON pipeline
1// Simplified sketch of a real CVE class.
2// Service A signs a JSON blob with an HMAC over the exact bytes.
3// Service B verifies the HMAC, then json-parses and uses the object.
4
5// Attacker-controlled JSON:
6const body = '{"user":"alice","role":"user","role":"admin"}';
7
8// A signs the bytes:
9const sig = hmac(body); // valid signature over the literal bytes above.
10
11// B verifies the HMAC -> passes.
12// B then json-parses.
13const obj = JSON.parse(body); // { user: "alice", role: "admin" } (last wins)
14
15// Depending on WHICH duplicate key the signer validated against,
16// this is an admin-privilege smuggling bug.
17//
18// The fix is not "tell people not to do this."
19// The fix is: canonicalize before signing and before verifying. JCS (RFC 8785)
20// produces a deterministic byte-exact serialization where duplicate keys are
21// disallowed. Sign and verify the canonical form, not the wire form.Numeric precision across JS and Python
1// A microservice receives an order ID from a payments service:
2// {"order_id": 9007199254740993, "amount": 100}
3
4// The payments service is written in Python, so the order ID is exact.
5// The JS microservice parses the body:
6const body = JSON.parse('{"order_id": 9007199254740993}');
7console.log(body.order_id); // 9007199254740992 <-- silently rounded
8
9// Now the JS service fetches order 9007199254740992, which belongs to
10// a different user. Same JSON bytes, different numeric interpretation.
11
12// Fix options:
13// 1. Send large IDs as strings. Always.
14// 2. Use a JSON parser with BigInt support (e.g. json-bigint) at the
15// boundary that receives them, and handle BigInt throughout.
16// 3. Validate that the parsed value round-trips: JSON.stringify(parsed) === originalBytes.Prototype pollution via __proto__
A POST body of {"__proto__":{"isAdmin":true}} passed through Object.assign({}, req.body) mutates Object.prototype. Every object in the process now has isAdmin === true. This is a JSON parser differential because JSON.parse sets __proto__ as an own-property key rather than mutating the prototype — but any subsequent merge or clone operation treats it as a magic property. Fix: refuse __proto__, constructor, and prototype keys in input, or use Object.create(null) for bags, or use a validated schema.
A team signs JWTs with HMAC-SHA256 and verifies them server-side. Someone reports that sending a header like `{"alg":"HS256","alg":"none"}` bypasses signature verification in the library they use. What is the root cause?
Code-Review Checklist
Parser differentials do not appear in any one file. They appear in the seam between two files, two services, or two machines. The review discipline is therefore structural: look for places where the same bytes are parsed more than once in the trust chain, and treat every such place as a bug until proven otherwise.
- The same user-supplied string is parsed by two different libraries, or by the same library at two different trust layers. (Validator+fetcher, WAF+origin, signer+verifier, router+handler.)
- An authorization check runs against a string (URL, path, identifier, header value) and the subsequent operation runs against a re-parsed version. Bugs hide in the second parse.
- Prefix or suffix checks on URLs, hostnames, or paths:
startsWith,endsWith,contains, regex-match against a raw string. Almost always a parser differential waiting to happen. - A signature verification step and a business-logic step that both query the same signed document. If the queries differ (XPath vs getElementById, top-level-key vs nested-key), the signature does not bind what the business logic reads.
- Any code that handles Content-Length, Transfer-Encoding, or HTTP/2 pseudo-headers manually instead of delegating to a vetted HTTP library.
- JSON that is signed, verified, and then re-parsed. Unless the signature is over the canonical form, duplicate keys and numeric precision are live bugs.
- Path handling that uses string concatenation (
root + user_input) and checks containment withstartsWith(root)without a trailing separator — classic sibling-directory bypass. - A WAF rule written as a regex against raw request bytes. WAF regex vs origin parser is the textbook differential.
- Any place where a single byte stream gives rise to two different names for "the same thing" — URL vs hostname, path vs canonical path, header name case, Unicode normalisation forms.
Greppable smells
1# A URL string is checked with startsWith/contains and later passed to a fetcher.
2rg -nE 'startsWith\(\s*["\x27]https?://' --type js --type ts
3rg -nE 'url\.startswith\(\s*["\x27]https?://' --type py
4
5# Path handling without canonicalisation.
6rg -nE 'join\([^)]*user|req\.|params\.' | rg -v 'resolve|normalize'
7rg -nE 'os\.path\.join.*request\.|open\(.*request\.' --type py
8
9# Signed-then-reparsed JSON.
10rg -n 'verify.*JSON\.parse|jwt\.decode.*JSON\.parse' --type js --type ts
11
12# Manual HTTP framing (a smell on its own).
13rg -nE 'Content-Length|Transfer-Encoding' --type go --type java --type py | rg -v '_test|mock'
14
15# WAF regex that tries to catch traversal.
16rg -nE '\\\.\\\./|%2e%2e' nginx.conf modsec/The one-question review
"Does the component that makes the security decision operate on the same representation as the component that performs the action?" If the answer is no, you have a parser differential. This one question catches almost every real-world bug in this class.