anykey.ai
Technical Briefing

How Quantization Benchmark Works

Comparing float32, float16, and int8 in the browser

Quantization Benchmark runs one TensorFlow.js model at different numeric precisions. It measures average inference time, estimated weight memory footprint, and how much output confidence shifts relative to a float32 baseline.

Data Pathway

1) One model, three precision modes

The experiment initializes a single convolutional model architecture and reuses it for each benchmark pass. For float32, weights are used as-is. For float16 and int8, each weight tensor is quantized and then loaded into an identical model clone.

01
Per-mode benchmark loop
for (const mode of ["float32", "float16", "int8"] as const) {
  const result = await runModeBenchmark(baseModel, inputTensor, mode);
  orderedResults.push(result);
}

2) Float16 simulation

Float16 keeps the same dynamic range idea as float32 but with fewer mantissa bits. The benchmark rounds each weight value to half-precision granularity, then stores it back as float32 for browser execution.

02
Float16 rounding
const exponent = Math.floor(Math.log2(clamped));
const normalized = clamped / 2 ** exponent;
const mantissa = Math.round((normalized - 1) * 1024) / 1024;
return sign * (1 + mantissa) * 2 ** exponent;

3) Int8 affine quantization

Int8 maps continuous weight values into 256 discrete buckets. The benchmark computes a scale from max absolute value, quantizes into [-128, 127], then dequantizes back to float values before inference.

03
Int8 quantize -> dequantize
const scale = maxAbs === 0 ? 1 : maxAbs / 127;
const q = Math.max(-128, Math.min(127, Math.round(value / scale)));
const dequantized = q * scale;

4) Measuring speed, memory, and quality

Speed is measured as average milliseconds across repeated inference passes. Memory is estimated from parameter count and bytes per precision (4, 2, or 1). Quality is reported as top-1 agreement and confidence drift versus float32.

04
Metric calculations
const memoryBytes = parameterCount * bytesForMode(mode);
const top1Agreement = sameTop1 ? 100 : 0;
const confidenceDrift = Math.abs(topPredictionProb - baselineProb);

Mission Debrief

Lower precision reduces model memory requirements.

Speed gains depend on backend and hardware support.

Quantization introduces rounding error that may shift confidence or top labels.