Thanks Marty, understood.
Marty McFly wrote:
And the output of each shader is then rounded to 8 bit per channel because that's the format of the backbuffer, applying any dither afterwards that only relies on values between 8 bit ones is senseless.
Indeed, CeeJay's dither shader is being applied to the high precision floating point values *before* final output (truncation to 8-bit) stage. The way he's done it is pretty clever actually, in such a way that the final 8-bit output only receives noise on pixels that were in between 8-bit values inside dither.h (method 2). I believe this is what keeps the visible noise level so low, which in my testing is needed to keep the noise level acceptable on 6-bit/FRC monitors.
Shame that we cannot read in a 16-bit LUT .png, because it's the only way to take advantage of CeeJay's exact method. The method I am using currently is inferior - by first applying random dither noise to the image (by increasing the dither_shift numerator in dither.h method 2; a value of ~3.75 seems to work well on my 8-bit monitor) and then afterwards applying your 1D texture lut shader. The gradient becomes free of banding artefacts, however noise is much more visible than CeeJay's original implementation, to the point where it doesn't look great on a 6-bit monitor. I have also adapted Martin's grain.h shader to do the same thing temporally, and the result is better (especially using CeeJay's subpixel trick of dithering separately on red+blue - very interesting and I have no idea why it works so well).