Blending combines colours from a source (what you draw) and a destination (what’s already in the framebuffer). Modern GPUs provide a simple, clamped linear blend model that is easy to map to fixed-function hardware. The PlayStation 2 (PS2) used a different, offset/accumulative blending formula that can produce intermediate values outside the [0,1] range — something fixed-function PC GPUs can’t natively reproduce. To match the PS2 result, we emulate the PS2 blend in fragment shaders, but that brings ordering, caching, and performance complications. Below, I rewrite the explanation with more depth and then add a unique section on testing/debugging and practical implementation patterns.
- PS2-style blending — the different equations and where trouble starts
- Emulation in fragment shaders — what we do and why
- Destination reads, texture cache, and coherence
- Performance trade-offs and optimisations
- Practical GLSL-style example (conceptual)
- Unique section — testing, debugging & correctness checklist (new)
- Final notes and practical recommendations
Blending on modern GPUs — the familiar, clamped model
Modern desktop/mobile GPUs expose a linear blend expressed typically as:
result = k1 * color1 ± k2 * color2
- where color1 and color2 are the source/destination colors and k1, k2 are factors (alpha, 1-alpha, or constants).
- Those coefficients are clamped to [0,1] by the fixed-function blending unit so the result stays in the normalized color range.
This clamping simplifies hardware design and avoids overflow — everything remains within legal color ranges for the render target (no intermediate >1 values). - Benefits of the fixed-function path: minimal overhead, guaranteed ordering semantics for blending (the hardware enforces correct in-order accumulation of fragments), and highly optimized memory/cache behavior.

PS2-style blending — the different equations and where trouble starts
The PS2 (and a few other fixed-function designs) use a different canonical form for blending:
result = (Color1 – Color2) * Coefficient + Color3
- Here, Color1, Color2, and Color3 can each be either the source colour, the destination color, or zero; Coefficient is typically an alpha or a constant.
This formulation is algebraically equivalent to some linear combinations but with a key practical difference: intermediate coefficients can make some terms exceed 1. Example rearrangement when Color3 == Color1:
result = Color1 * (1 + Coefficient) – Color2 * Coefficient
- If Coefficient > 0, the Color1 * (1 + Coefficient) term may exceed 1. On PS2 hardware, that internal representation and clamping strategy were handled in the fixed-function pipeline, but on PC GPUs the fixed-function blend expects coefficients producing final values inside [0,1].
- Another subtlety: the PS2 coefficient sometimes effectively ranges up to ~2 (implementation details vary). When coefficients can exceed 1, a typical PC fixed-function blend cannot represent the PS2 result without extra computation.
- Net result: some PS2 blends are not representable on PC fixed-function blend units without extra work — hence “impossible” blends.
Emulation in fragment shaders — what we do and why
- Why use fragment shaders? Fragment shaders are programmable and can compute arbitrary expressions per-pixel; they can reproduce (Color1 – Color2) * coeff + Color3 exactly (within floating precision). They let us avoid the fixed-function blend unit’s restrictions.
- How it works conceptually:
- Sample the destination color (either via a texture read of the previous contents or via an extension that provides framebuffer fetch).
- Compute the PS2-formula in the shader with floating-point math.
- Output the computed colour directly to the render target (with blending disabled) so the fragment shader result becomes the new pixel value.
- Ordering problem: Fragment shaders are executed out-of-order and in parallel for performance (tiles/wavefronts). Blending is inherently order-dependent: later fragments must blend over earlier ones in a deterministic in-order stream. If two draw calls produce overlapping fragments, and the GPU computes them out-of-order, the emulated blend would be incorrect.
- Workaround for ordering: As long as draw primitives do not overlap each other in the framebuffer, order doesn’t matter — only one fragment contributes per pixel. So a practical approach is to split draws into non-overlapping batches: each batch is drawn with the emulating shader and guarantees no overlap among its primitives. This enforces correctness at the cost of additional draw call overhead.

Destination reads, texture cache, and coherence
- Why reading destination is hard: GPUs optimise reads with a texture cache that assumes render-target content won’t change in-flight. The cache is typically read-only from the viewpoint of fragment shaders to avoid complex coherency machinery.
- Consequences: If you write to the framebuffer and then, in the same pass, read back from it (e.g., the next fragment expects to sample the just-written pixel), the texture cache may return stale data — because writes don’t automatically invalidate read caches. That breaks an emulation that depends on sampling the current destination value.
- Practical solutions:
- Ping-pong rendering: Render into an off-screen texture (a render target) while sampling from a previous texture that holds the prior frame or prior pass results. After the pass completes, swap the textures. This avoids in-flight read-after-write hazards.
- Invalidate/flush mechanisms: Modern drivers and APIs provide ways to invalidate caches or to signal coherency (API and extension names vary by platform). Use these primitives carefully — they can have a performance cost.
- Framebuffer-fetch extensions / Image load-store: On some platforms, there are extensions or image load/store APIs that allow directly reading the current framebuffer value or atomic updates — these require explicit memory barriers and have portability/perf implications.
Performance trade-offs and optimisations
- Draw call splitting cost: Splitting a big draw into multiple non-overlapping draws increases CPU overhead and can hurt throughput. Only do this when correctness requires PS2-accurate results.
- Batching strategies: When possible, minimise splits by grouping primitives that are guaranteed non-overlapping (spatial partitioning, binning, or coarse scissor rectangles).
- Depth/stencil tricks: Use depth pre-pass or early-z to discard fragments that won’t affect the final image, reducing fragment shader invocations. If primitives are opaque in some regions, leverage depth to avoid expensive shader runs.
- Choose the right destination-access method:
- Ping-pong textures are simple, robust and portable — but cost an extra texture and a resolve/swap.
- Framebuffer fetch / image load-store can reduce memory traffic when available, but require careful barrier use and aren’t universally supported.
- Precision considerations: Use the GPU’s native float precision (usually 32-bit) for accurate emulation. Lower precision formats (16-bit) can introduce visible differences versus the PS2 reference.
Practical GLSL-style example (conceptual)
- This pseudocode shows a straightforward fragment shader approach assuming you have bound the previous frame or previous pass render target as uDestTex. It’s intentionally API-agnostic — exact binding and sampling depends on your engine and GL/Vulkan/DX version.
// Fragment shader (conceptual)
uniform sampler2D uDestTex; // destination contents from previous pass
in vec2 vUV; // interpolated UV for this fragment
in vec4 vSrcColor; // source color for this fragment
out vec4 FragColor;
void main() {
vec4 dest = texture(uDestTex, vUV); // read destination
float coeff = /* compute coefficient (alpha, constant, etc.) */;
// PS2-style: result = (Color1 - Color2) * coeff + Color3
// choose Color1/Color2/Color3 per the PS2 blend mode
vec4 color1 = vSrcColor; // example
vec4 color2 = dest; // example
vec4 color3 = vec4(0.0); // example
vec4 result = (color1 - color2) * coeff + color3;
FragColor = result;
}- Notes: In practice, you will:
- Compute the right color1/color2/color3 based on the emulated PS2 blend configuration.
- Use a ping-pong render target or an earlier resolved texture to ensure the texture() read returns the intended destination value.
- Disable fixed-function blending for this draw (you’re writing the final color directly).
Unique section — testing, debugging & correctness checklist (new)
- Visual difference map: Render both the original PS2 reference (or a trusted software reference) and your emulated output to two textures, then render a debug view that shows per-pixel absolute differences (e.g., abs(ref – emu)). This quickly highlights where emulation breaks and whether the error is small (rounding) or large (ordering/cache issue).
- Ordering test suite: Create automated tests with controlled primitives:
- Non-overlapping primitives (should match exactly).
- Overlapping primitives drawn in known order11 (verify you either split draws correctly or accept a fallback approximation).
- Randomized alpha and coefficient ranges (to stress coefficient clamping / overflow conditions).
- Performance regression harness: Measure the number of draw calls, fragment shader invocations, and GPU memory traffic before and after enabling emulation. Track frame time across scenes with varying overlap density to learn trade-offs.
- Shader fallback visualization: If performance is unacceptable in a scene, renderers should provide a “quality vs speed” fallback:
- High-quality: full PS2-accurate shader emulation with draw splitting.
- Fast: approximate blend implemented with fixed-function blending or a simplified shader (clamping coefficients to [0,1]). Visualize where the fallback is used to help tune scene content.
- Automation & tooling tips: Integrate the tests into your CI so changes to batching, barrier placement, or shader code don’t silently regress accuracy. Use GPU profilers (vendor tools) to inspect cache misses and memory stalls caused by destination sampling patterns.
Final notes and practical recommendations
- Emulating exotic fixed-function blending in a fragment shader is perfectly viable and often the cleanest path when you require bit-exact or visually faithful behavior.
- Expect a performance cost — plan for profiling, selective application (only where needed), and fallbacks for older hardware.
- Use robust development tooling: visual diffs, deterministic test scenes, and a performance harness to make data-driven decisions.
