ClickHouse NumericIndexedVector: Design and Implementation of a Sparse Numeric Vector on Bitmap + BSI

In data analytics, bitmaps are one of the most satisfying data structures — a few AND / OR / XOR ops and a set computation is done, fast and cheap. But they have a built-in ceiling: each position can only answer “present or absent.” The moment the question becomes “what real value does this key carry?” — dwell time, order amount, some dimensional count — a plain bitmap is no longer enough. At that point we have to fall back to wide tables, whose compression is far weaker than what a bitmap can deliver.

This post introduces NumericIndexedVector, recently merged into ClickHouse. Its goal is exactly to fill that gap: store a column of (key, value) pairs as a “sparse vector indexed by key” — Bitmap keeps the storage compact, and pointwise add / sub / mul / div and aggregation by key are computed directly on the Bitmap via AND / OR / NOT, without first unpacking it back into row-wise (key, value).

This is the first post of a two-part series and covers why / what / how in one shot — the motivation, the BSI encoding, the bit-level algorithms for add / sub / mul / div / compare, and the engineering landing inside ClickHouse. The next post shifts to how to use it well and the measured payoff.

1. From Bitmap to RoaringBitmap

A bitmap is just a contiguous sequence of bits, where each bit records whether one integer belongs to the set — bit i set means i is a member, unset means it isn’t.

Two bitmaps give you every boolean set operation for free: AND for intersection, OR for union, XOR for symmetric difference, ANDNOT for difference. With POPCOUNT on top, cardinality is O(n/64) (scalar), O(n/256) (AVX2), or O(n/512) (AVX-512).

Three uses dominate OLAP:

Inverted indexes / boolean filters — many predicates AND-ed together are just bitmaps AND-ed together. Each extra predicate costs one more bitwise op.
User profiling / tag systems — keep a bitmap per tag (who has the tag), and audience selection becomes boolean algebra over those bitmaps.
Distinct counts — push user-ids into a bitmap and POPCOUNT at the end for the unique count. No sorting, no hash table.

The practical problem with naive bitmaps is sparsity. A UInt32 key means 2³² (≈ 4.3 × 10⁹) bits — a fixed 512 MB of storage even when the actual set holds only a few thousand members. RoaringBitmap sidesteps this with per-bucket adaptive containers and is one of the most efficient implementations today; for a UInt32 key its layout looks like:

The key is split by its top 16 bits into buckets, each bucket holding at most 2¹⁶ members (the low 16 bits).
Each bucket picks one of three container types based on its cardinality and distribution:
- Array container — cardinality ≤ 4096: sorted array, 2 bytes per element.
- Bitmap container — higher cardinality: an 8 KB plain bitmap, O(1) random access.
- Run container — many contiguous ranges: a list of (start, length) pairs, the densest option.
Every boolean op (AND / OR / ANDNOT / XOR) has a dedicated implementation per container pair and goes through SIMD, fully leveraging vectorization.

You get near-optimal storage across the sparsity spectrum without giving up bitwise ops. RoaringBitmap has an official site and reference implementations in multiple languages; ClickHouse’s AggregateFunction(groupBitmap, UInt32) is built directly on top of it.

If you dig into RoaringBitmap, you’ll find the implementation is genuinely impressive — both in storage and in compression. Even so, the ceiling is clear: each position only encodes “present / absent” — the per-position value cannot be extended into the real-number domain. Yet analytics is rarely just about 0 / 1 signals like “active today” or “clicked the ad”; far more often the metrics that matter are real-valued — dwell time, transaction amount, dimensional counts. So two questions: how do we keep RoaringBitmap’s storage benefits for those metrics? And — going one step further — can the usual add / sub / mul / div run directly on top of RoaringBitmap’s And / Or / Not / Xor?

NumericIndexedVector answers both, with RoaringBitmap as its underlying storage. Understanding RoaringBitmap therefore pays dividends when using NumericIndexedVector well.

2. The Analytics Workload, Abstracted

A very common class of analytics query: two large tables joined on some UInt32 key, followed by filtering / aggregation / pointwise arithmetic. The key could be user-id, item-id, session-id, device-id — it doesn’t really matter.

SELECT A.key, f(A.val_a, B.val_b)
FROM A JOIN B ON A.key = B.key;

The typical cost breakdown:

key cardinality easily hits tens or hundreds of millions; for products like TikTok or WeChat with hundreds of millions of DAU, this kind of workload burns petabyte-scale I/O and millions of CPU core-hours per day.
the hash join has to reshuffle both tables by key;
after the join, every row pays for f and the final row-wise sum — it’s row-wise scanning end to end.

A natural idea: if we compress the val_a / val_b columns we want to compute on into “a vector keyed by id,” the join semantics get absorbed by the convention “same position in both vectors means same key,” and arithmetic becomes a per-position pointwise op. Bitmaps already do this for boolean columns — can we do it for real-valued columns too?

Yes. That’s what NumericIndexedVector is about.

3. NumericIndexedVector Core Idea

NumericIndexedVector is built on the paper Large-Scale Metric Computation in Online Controlled Experiment Platform; the key technique it uses is Bit-Sliced Index (BSI).

Say we want to represent a (key: UInt32, value: UInt64) mapping as a vector. value is UInt64, up to 64 bits, so we prepare 64 bitmap layers — call them $B_0, B_1, \ldots, B_{63}$ . Each layer corresponds to one binary digit of the value:

$B_i[j] = 1$ iff the value for the $j$ -th key has its $i$ -th bit set.

A simple weighted sum reconstructs the value (call it $C[j]$ ):

C[j] \;=\; \sum_{i=0}^{63} B_i[j] \cdot 2^i

A tiny example

Four keys (j = 0, 1, 2, 3) with values 5, 2, 7, 0. In 3-bit form:

key `j`	value	bit 2	bit 1	bit 0
0	5 (101)	1	0	1
1	2 (010)	0	1	0
2	7 (111)	1	1	1
3	0 (000)	0	0	0

The three layers are:

$B_0 = \{0, 2\}$ (keys whose bit 0 is set)
$B_1 = \{1, 2\}$
$B_2 = \{0, 2\}$

Check: C[0] = 1·1 + 0·2 + 1·4 = 5 ✓, C[2] = 1 + 2 + 4 = 7 ✓.

The big picture

Original BSI is defined over non-negative integers; NumericIndexedVector extends it to signed values via two’s complement (next section).
It can be extended further into the real-number domain via fixed-point: dedicate the low $f$ bits to the fractional part and the rest to the integer part, storing $\text{round}(\text{value} \times 2^f)$ . For example, UInt64 with 14 fractional bits gives a smallest unit of $2^{-14} \approx 6.1 \times 10^{-5}$ and a per-value rounding error of at most $2^{-15} \approx 3.05 \times 10^{-5}$ (~5 decimal digits of precision); the integer part still has 50 bits (up to $\approx 1.1 \times 10^{15}$ ), which comfortably covers metrics like dwell time. For amount-style metrics it’s usually cleaner to multiply upstream by a constant and store as an integer — zero rounding error.
The thing worth remembering: every layer $B_i$ is an independent RoaringBitmap. Every layer is compact under sparse key spaces (the next post covers further tricks that push bitmap compactness even further), and every “bitwise boolean op across layers” rides RoaringBitmap’s SIMD implementation.

A NumericIndexedVector is an ordered stack of RoaringBitmap layers plus a tiny header (bit width, signed flag). Let’s see how arithmetic and comparison work on that stack.

4. Bit-Level Algorithms: Add / Sub / Mul / Div / Compare

I’ll use ^ / & / &~ / | for bitmap XOR / AND / ANDNOT / OR throughout. $N$ layers are indexed low-to-high as X[0]..X[N-1] and Y[0]..Y[N-1]. All pseudocode in this section comes from the paper Large-Scale Metric Computation in Online Controlled Experiment Platform cited above.

4.1 Addition: Full Adder Lifted to the Bitmap Layer

A single-bit full adder takes two input bits $X_i$ , $Y_i$ and an incoming carry Cin, producing a sum bit $S_i$ and an outgoing carry Cout:

Sᵢ   = Xᵢ ^ Yᵢ ^ Cin
Cout = (Xᵢ & Yᵢ) | (Cin & (Xᵢ ^ Yᵢ))

Replace each “bit at position $i$ ” with “entire bitmap at layer $i$ ,” and ^ / & / | become bitwise ops on bitmaps. BSI addition drops out directly:

input:  X[0..N-1], Y[0..N-1]
output: S[0..N-1]

bitmap Cin = empty;
for i in 0..N-1:
    S[i] = X[i] ^ Y[i] ^ Cin;
    Cin  = (X[i] & Y[i]) | (Cin & (X[i] ^ Y[i]));

$N$ loop iterations, with a constant number (about 5) of bitmap ops per iteration; total time is roughly $O(N \cdot |B|_{\text{op}})$ , where $|B|_{\text{op}}$ is the cost of one bitmap bitwise op (the per-iteration constant is absorbed by big-O). The result stays at the same bit width — if the highest Cout is non-empty, it’s an overflow, and you either truncate or allocate more output layers.

4.2 Subtraction: Borrow Instead of Carry

A full subtractor replaces carry with borrow Bin:

Dᵢ   = Xᵢ ^ Yᵢ ^ Bin
Bout = ((Yᵢ | Bin) &~ Xᵢ) | (Bin & Yᵢ)

One-to-one translation to bitmaps:

input:  X[0..N-1], Y[0..N-1]
output: D[0..N-1]

bitmap Bin = empty;
for i in 0..N-1:
    D[i] = X[i] ^ Y[i] ^ Bin;
    Bin  = ((Y[i] | Bin) &~ X[i]) | (Bin & Y[i]);

Worth noting: the borrow Bin left at loop exit is exactly the bitmap of X < Y — which is why the comparison operators can reuse the subtraction path (we’ll use it directly in the next subsection).

4.3 Multiplication: Long Multiplication + Reusing Addition

Multiplication has no single-step closed form, so we do long multiplication — for each bit i of Y, if Y[i] is non-empty, treat X & Y[i] as the summand, left-shift by i, and add it to the accumulator M:

input:  X[0..N-1], Y[0..N-1]
output: M[0..2N-1]              // result is up to 2N bits wide

bitmap M[2N] = {empty};
for i in 0..N-1:
    if Y[i] is empty: continue;
    bitmap S[N];
    for j in 0..N-1:
        S[j] = X[j] & Y[i];
    bitmap Cin = empty, tmp;
    for k in i..i+N-1:
        tmp  = M[k] ^ S[k-i] ^ Cin;
        Cin  = (M[k] & S[k-i]) | (Cin & (M[k] ^ S[k-i]));
        M[k] = tmp;

The outer loop walks bits of Y; the inner loop is a single addition recurrence from 4.1. The result is up to 2N bits wide; production implementations truncate to the declared width — the next section covers that trade-off.

⚙️ What the implementation actually does: the long-multiplication recurrence above is the most direct mathematical formulation, but “simulating hardware multiply in the BSI domain” carries a large constant factor (each result bit pays for a full addition recurrence, totaling $O(N^2 \cdot |B|_{\text{op}})$ ). NumericIndexedVector’s implementation takes a different path — decode both vectors into row-wise (key, value) arrays, multiply with Float64 element-wise, and re-encode back into BSI. The round-trip gives up “everything stays in the compressed domain,” but the constant is much smaller. See pointwiseRawBinaryOperate in AggregateFunctionGroupNumericIndexedVectorDataBSI.h.

4.4 Comparison: Borrow Bit + Equality

With subtraction in hand, < / > are nearly free — keep only the borrow output L from 4.2:

// L = 1 means X[j] < Y[j]
bitmap L = empty;
for i in 0..N-1:
    L = ((Y[i] | L) &~ X[i]) | (L & Y[i]);

Swap X and Y for >.

Equality takes a different path — differ on any bit means unequal:

// E = 1 means X[j] == Y[j]
// U is the non-zero key set of the sparser side of X or Y
bitmap E = U;
for i in 0..N-1:
    E = E &~ (X[i] ^ Y[i]);

!= is the negation of E; <= / >= are < / > OR-ed with ==:

bitmap L = empty;
bitmap E = U;
for i in 0..N-1:
    L = ((Y[i] | L) &~ X[i]) | (L & Y[i]);
    E = E &~ (X[i] ^ Y[i]);
LE = L | E;

All comparison operators produce a vector with values ∈ {0, 1} — semantically, “the set of keys satisfying the predicate.” That vector can be fed straight into numericIndexedVectorPointwiseMultiply as a filter mask — the next post uses this pattern extensively.

4.5 Division: A Shift–Compare–Masked-Subtract Loop

Division has no neat bit-level closed form like add / sub / mul, but classic restoring binary long division in hardware does the job (see sistenix’s illustrated walkthrough). Starting from the most significant quotient bit, every iteration is a shift → compare → masked subtract:

$Y << i$ left-shifts the divisor by $i$ ;
$X > (Y << i)$ produces the $i$ -th quotient bit $a_i$ (one bit per key);
on keys where $a_i = 1$ , subtract $Y << i$ from $X$ , then move on to the next bit.

Because NumericIndexedVector supports fixed-point reals, the implementation first left-shifts the dividend $X$ by decimal_bm_num layers to preserve fractional precision; both dividend and divisor are aligned to the same width and fractional bits before the loop:

input:  X[0..N-1], Y[0..N-1]      // dividend X, divisor Y (aligned to N bits)
output: Q[0..N-1]                  // quotient

bitmap Q[N] = {empty};

// Pre-shift Y all the way left; each loop iteration shifts back right by 1
// so that Y at iteration i is (original Y) << i
Y = shift_left(Y, N);

for i in N-1 .. 0:
    Y = shift_right(Y, 1);             // now Y == (original Y) << i

    // Compare: a_i = X > Y — a {0, 1}-valued bitmap, the i-th quotient bit
    Q[i] = isGreaterThan(X, Y);

    // Masked subtract: only on keys where Q[i] = 1
    masked_Y[0..N-1] = { Y[j] & Q[i] for j in 0..N-1 };
    X = subtract(X, masked_Y);

$N$ outer iterations, each one a single compare + masked subtract — the compare rides 4.4’s $O(N)$ recurrence and the subtract rides 4.2’s $O(N)$ , giving an overall $O(N^2 \cdot |B|_{\text{op}})$ . That’s the fundamental reason division is more expensive than add / sub / mul: every quotient bit pays for a full compare and a full subtract.

One implementation detail: keys present in the dividend but absent from the divisor are filtered out beforehand (an AND against the union of all divisor layers), so “divide by empty” doesn’t leak ambiguous results.

⚙️ What the implementation actually does: like multiplication, BSI-domain long division has a large constant factor (each quotient bit pays for a full compare + masked subtract, totaling $O(N^2 \cdot |B|_{\text{op}})$ ). NumericIndexedVector follows the same “decode → compute → encode” path — decode both sides to row-wise values, divide element-wise as Float64, re-encode back into BSI, sharing the pointwiseRawBinaryOperate codepath with multiplication. The hardware long-division recurrence above is still the cleanest mental model for “how it could be done in the BSI domain”; the engineering trade just preferred simplicity and a smaller constant.

5. Bit Width and Signed Extension

Width Is Fixed at Construction Time

The original BSI naturally supports variable length — use as many layers as the value needs. NumericIndexedVector instead fixes both the width N and the signed flag at construction time, by analogy with C++ integer types:

Simpler inner loops: every loop is for i in 0..N-1 with no runtime layer discovery;
Unused high-bit layers are all-empty RoaringBitmaps — zero storage, zero compute;
Two vectors that need pointwise ops must share the same width, or they need an explicit conversion first.

This also answers a frequently asked question — does picking a wider type waste storage? No. In a 32-layer UInt32 vector, if most values fall in [0, 255], every layer above bit 8 is empty and serializes to a tiny header. Storage cost tracks the bits actually used, not the declared width.

Signed Integers: Two’s Complement

BSI is defined over non-negative integers; negative values follow the hardware playbook — two’s complement:

The top layer acts as the sign bit;
The add / sub recurrences are unchanged, and any overflow Cout is discarded (wrap-around);
Multiplication handles the sign separately (or runs unsigned and flips the sign when operands differ);
Comparison has one extra check at the sign bit — if the two operands differ in sign, the negative side is always smaller.

Boundary policies (truncation vs exception on overflow, division by zero) are branches inside the implementation; the code lives in PR-74193.

Real-Number Support: Fixed-Point Defaults in the Implementation

Section 3 already mentioned that BSI extends to the real-number domain via fixed-point. For Float32 / Float64 inputs, NumericIndexedVector hardcodes a default config (constants in BSINumericIndexedVector):

Constant	Default	Meaning
`DEFAULT_INTEGER_BIT_NUM`	40	Integer-part width: signed range $\pm 2^{39} \approx \pm 5.5 \times 10^{11}$ — covers most analytics magnitudes (amount, count, duration).
`DEFAULT_FRACTION_BIT_NUM`	24	Fractional-part width: smallest unit $2^{-24} \approx 5.96 \times 10^{-8}$ , ~7 decimal digits of precision.
`MAX_TOTAL_BIT_NUM`	64	Hard cap: `integer_bit_num + fraction_bit_num ≤ 64`.
`MAX_FRACTION_BIT_NUM`	24	Fraction-width upper bound: finer precision rarely matters in analytics, only inflates storage / compute.

Why not IEEE 754 floats? Because the variable-length exponent breaks BSI’s “every layer is independent” premise — each key’s mantissa would align differently, so per-bit slicing loses meaning. Fixed-point forces every value onto the same scale, so each layer can do bitwise ops independently.

Value-Is-Zero vs Key-Doesn’t-Exist: the `zero_indexes` Patch

BSI on its own has only one “empty” — every layer is 0 at a given key. But analytics needs to distinguish “value explicitly equals 0” from “key doesn’t exist at all” (e.g. “this user’s recharge is 0” vs “this user is not in the table”). NumericIndexedVector keeps a separate zero_indexes bitmap for keys that exist with value 0:

data_array: 64 BSI bitmap layers, holding all (key, value) with value ≠ 0;
zero_indexes: a single RoaringBitmap holding all keys with value = 0.

So cardinality is data_array_keys ∪ zero_indexes, and a per-key value lookup must also check zero_indexes when all BSI layers are empty. This piece isn’t in the BSI paper, but it’s required in practice — otherwise numericIndexedVectorCardinality and numericIndexedVectorGetValue would conflate 0 with absence.

6. Landing in ClickHouse: Type + Aggregate Functions + Regular UDFs

ClickHouse splits NumericIndexedVector into three layers:

┌───────────────────────────────────────────┐
│  Data type                                │
│  AggregateFunction(                       │
│    groupNumericIndexedVector, ...)        │
└───────────────────────────────────────────┘
           ▲                    ▲
           │                    │
┌──────────┴────────┐  ┌────────┴───────────┐
│  Aggregate funcs  │  │  Regular UDFs      │
│  group... family  │  │  pointwise / cmp   │
│  (State/Merge/    │  │  / extract / agg   │
│   Finalize)       │  │                    │
└───────────────────┘  └────────────────────┘

Data Type

Reuses ClickHouse’s existing AggregateFunction(...) skeleton. Construction takes two arguments — key first, value second:

key type: currently only UInt8/16/32 and the matching Int8/16/32 — because the underlying RoaringBitmap indexes by 32-bit integers; 64-bit-key support takes the bucketing route, covered in the next post.
underlying numeric (value) type: fixes width, signedness, and whether fixed-point; supports UInt8/16/32/64, the matching Int family, and Float32 / Float64.

The expression looks like:

AggregateFunction(groupNumericIndexedVector, UInt32, UInt64)
-- key=UInt32, value=UInt64

Float32 / Float64 use the 40 + 24 fixed-point encoding from the previous section.

Aggregate Functions: the `group...` Family

groupNumericIndexedVector(key, value) — under GROUP BY, aggregates per key into a vector. Semantically equivalent to “sum per key, then pack into a vector.”
groupNumericIndexedVectorState(...) — returns an intermediate state usable in AggregateFunction columns, letting materialized views / pre-aggregated tables persist the vector (the next post expands on this).
numericIndexedVectorBuild(map) — bypasses GROUP BY and constructs directly from a Map.

Regular UDFs

Every UDF operates on already-constructed vectors. By purpose, four groups:

Pointwise arithmetic (both sides vectors, or vector + scalar):

numericIndexedVectorPointwiseAdd / Subtract / Multiply / Divide

Pointwise comparison (output is a {0, 1}-valued vector):

numericIndexedVectorPointwiseEqual / NotEqual
numericIndexedVectorPointwiseLess / LessEqual / Greater / GreaterEqual

Extraction:

numericIndexedVectorGetValue(v, key) — fetch a single value by key;
numericIndexedVectorToMap(v) — unwrap to Map(key, value).

Aggregation (returns a scalar):

numericIndexedVectorAllValueSum(v) — sum over all values in the vector;
numericIndexedVectorCardinality(v) — count of non-zero keys.

The follow-up numericIndexedVectorPointwiseMin / Max and their global variants (PR-88840) could be composed from compare × multiply × add, but a dedicated implementation avoids one intermediate vector.

Wrapping Up + What’s Next

The BSI encoding, the bit-level pseudocode, the fixed-point defaults, and the zero_indexes patch — they’re all answers to the same question: how do you make a bitmap that originally only encodes “in or out of the set” carry an actual value, without paying the storage cost of widening into a row-wise table? The answer is to slice the value’s bits into a stack of RoaringBitmaps and let every arithmetic operation collapse into RoaringBitmap-level AND / OR / NOT. That path is BSI’s core idea, and it also pins down the boundaries. A few things worth keeping in mind that aren’t called out directly in the body:

Sparsity is a prerequisite for any payoff. If your key space is 95% filled, every BSI layer degenerates into a dense bitmap container — no storage win, no SIMD-acceleration win. The sweet spot is “key space in the billions, actual cardinality in the hundreds of thousands to tens of millions.”
Be conservative with bit width — especially the fractional bits. Extra integer bits aren’t really that costly: as section 5 noted, unused high-bit layers are empty, so they take essentially no storage and bitwise ops on them are near-free (the multiplication pseudocode even has if Y[i] is empty: continue;). Fractional bits behave differently: fixed-point quantizes every value onto a $2^{-f}$ grid, so most values are non-empty in the fractional layers, and every extra fractional bit pays full storage and compute cost. The default 24 fractional bits for Float64 already balances “precise enough for analytics” against “not wasteful”; going wider rarely helps.
Not every operator stays in the compressed domain. Section 4 can leave the impression that all arithmetic happens inside BSI — only add / sub / compare actually do. Once mul / div enter the picture it’s a “decode → row-wise Float64 compute → re-encode” round trip; one line of numericIndexedVectorPointwiseMultiply in SQL is a full decode → compute → encode loop underneath.
zero_indexes exists for analytics semantics, not algorithmic reasons. To the algorithm itself, value=0 and missing key are indistinguishable; to analytics they aren’t. The same split shows up again in materialized views, the serialization format, and SUM semantics — the next post runs into it more than once.

The next post flips the lens to how to use it well — which workloads to migrate, three SQL rewrite patterns (pointwise, join elimination, conditional sum), how bucketing keeps RoaringBitmap cheapest and lifts 32-bit position encoding into 64-bit key support, engineering practice around materialized views and width selection, and a benchmark drawn from the paper (storage, single-shot compute, join replacement, ad-hoc latency).