The previous post covered the internals of NumericIndexedVector — BSI encoding, the bit-level algorithms for add / sub / mul / div / compare, and the three-layer “type + aggregate function + regular UDF” structure on the ClickHouse side. This post covers its best practices: bucketing design and Position Encoding mechanics, how to determine which workloads to migrate, how to rewrite the SQL and design the write path, plus the measured payoff from the paper.
All benchmark numbers below come from the paper by Xiong & Wang, measured in WeChat scenarios.
1. Bucketing and Position Encoding
Whether NumericIndexedVector works well comes down to two things — bucketing and Position Encoding. Bucketing lets vectors shard across a distributed cluster and supports the variance estimation experiment platforms need. Position Encoding does another round of “key → compact position” re-encoding within each bucket, so every BSI bitmap lands close to optimal storage. This section covers both for the 32-bit key case (§1.1 + §1.2), then extends to 64-bit keys in §1.3 — which is essentially the same hash + bucket + within-bucket encoding scheme.
1.1 The two core purposes of bucketing + the MurmurHash32 scheme
Bucketing in NumericIndexedVector serves two purposes:
- Data sharding: different buckets can be distributed across different ClickHouse shards, so both construction and queries leverage distributed cluster parallelism.
- Variance estimation: in experiment / A-B testing platforms, metrics need variance estimates. If buckets are randomly and statistically independent, variance can be estimated directly on bucket aggregates without rescanning the raw data (see Covariance Estimation and its Application in Large-Scale Online Controlled Experiments).
The common scheme that satisfies both purposes is MurmurHash32 + high-bit contiguous-segment bucketing (for UInt32 keys):
- MurmurHash32 is a bijection within the
UInt32space — business keys hash to a uniformly distributed value, and the original key is recoverable from the hash. This property simultaneously delivers: even sharding (purpose 1) and per-bucket randomness / independence (purpose 2). - Take the top 10 bits of the hash as the bucket id, giving 1024 buckets. The bucket id is used purely for sharding routing (which shard / which vector the data lands in).
- Inside each vector, the full 32-bit hash value is stored (without truncating the high 10 bits). The in-bucket hash values are concentrated in a (≈ 4M) contiguous hash segment — specifically
[bucket_id × 2^{22}, (bucket_id+1) × 2^{22}). The number of in-bucket hash values is roughly 1/1024 of the total cardinality (tens to hundreds of thousands), still sparse across those 4M positions; §1.2’s Position Encoding does the per-bucket 0-based re-encoding to push each BSI bitmap close to optimal.
-- MurmurHash32 + top 10 bits as bucket id (UInt32 key)
SELECT
dt,
bitShiftRight(murmurHash3_32(key), 22) AS bucket_id,
groupNumericIndexedVectorState(murmurHash3_32(key), val) AS v_state -- in-vector key is the full hash
FROM source_table
GROUP BY dt, bucket_id;
Three anti-patterns, each failing at one of the three design layers (“bucketing / hashing / high-bit contiguous segments”):
- Skip bucketing, dump all business keys into a single vector: all data lands in one vector, which can’t be distributed across shards — the cluster’s parallelism goes unused; a single vector also gets too large, slowing both construction and queries.
- Bucket without hashing: bucketing directly on the high bits of the raw business key. Business-key high bits are typically uneven (user-ids cluster in certain ranges, order-ids increase by time, etc.), so buckets become unbalanced — sharding is uneven and variance estimation breaks down.
- Hash, but bucket by
modor other random methods (instead of high-bit contiguous segments): keys within a bucket no longer cluster in a continuous hash range — they scatter across the entire 32-bit space. The underlying RoaringBitmap’s top-level containers each receive very few keys, almost all stored as Array containers; per-bucket storage cost rises and queries get slower.
1.2 Position Encoding: further within-bucket compactness
§1.1 shrunk each bucket’s hash range to a contiguous ≈ 4M segment. But the hash values actually present in a bucket (tens to hundreds of thousands) are still sparsely distributed across those 4M positions — they don’t automatically land on a contiguous 0..N range.
Position Encoding does another round of re-encoding within each bucket: it maps the keys actually present in the bucket to consecutive positions 0, 1, 2, ... in order of appearance, and every BSI bitmap is built on this contiguous position space (paper §3.4).
Two engineering benefits:
-
In-bucket BSI bitmaps become extremely compact: each bitmap’s cardinality equals “the number of non-zero keys”, and they are 0-based contiguous integers, landing on RoaringBitmap’s Run / Array containers — storage cost is close to optimal. RoaringBitmap’s three container tiers:
In-bucket position distribution Container Per-bucket cost Many contiguous ranges (the post-encoding norm) Run: (start, length)pairsSmallest, tens of bytes per range Sparse (≤ 4096) Array: sorted array (2 bytes / element) KB Dense random (> 4096) Bitmap: 8 KB raw bitmap Fixed 8 KB Position Encoding concentrates per-bucket positions onto a few top-level containers (mostly the cheapest Run container), bringing per-bucket cost to tens of bytes to a few KB. Without Position Encoding, in-bucket keys scatter across the business key space, requiring many more top-level containers; total storage cost rises with key sparsity.
-
Sharing the encoding eliminates joins: when two vectors share the same Position Encoding, “same position means same key”. Subsequent pointwise operations therefore implicitly perform “match by key” without any explicit join or shuffle.
Position Encoding consistency is also an implicit prerequisite during the usage stage: the two vectors participating in a pointwise operation must come from the same encoding, otherwise “same position” maps to different keys. Cross-partition / cross-version merges must go through Merge to unify the encoding first.
1.3 Lift index to 64 bits
Business keys are often 64 bits — global snowflake ids, cross-tenant global user-ids, hash fingerprints. Supporting 64-bit keys uses the same hash + bucket + within-bucket Position Encoding scheme as §1.1, just with a 64-bit hash function and the more direct mod-based bucketing:
-
Hash + bucket: use
murmurHash3_64to scatter business keys into 64-bit hash values, then take the hash modulo 1024 asbucket_id, giving 1024 buckets (same magnitude as §1.1, sufficient for sharding while keeping per-bucket cardinality manageable); -
Within-bucket Position Encoding: each bucket maintains an encoding dictionary + current max position counter, mapping each key seen to a 0-based contiguous position — already-seen keys return the same position, new keys get
max + 1. This is the Position Encoding logic itself:# Each bucket_id keeps its own encoding dictionary + current max position encoding: Dict[bucket_id, Dict[key, UInt32]] = {} next_pos: Dict[bucket_id, UInt32] = {} def encode(bucket_id, key) -> UInt32: if key in encoding[bucket_id]: return encoding[bucket_id][key] # already encoded, return existing position pos = next_pos[bucket_id] encoding[bucket_id][key] = pos # new key, assign the next contiguous position next_pos[bucket_id] = pos + 1 return pos
Current limitation: NumericIndexedVector in ClickHouse does not natively support 64-bit keys today (the key parameter type tops out at UInt32, see Post 1 §6). The “hash-based bucketing + per-bucket Position Encoding” pattern in this section is the common engineering composition that uses the existing 32-bit interface to handle 64-bit business keys.
Capacity: each bucket can hold up to unique keys (Position Encoding uses UInt32 for the position); total system capacity is roughly , sufficient for most current businesses.
-- Conceptual SQL: encode is the Position Encoding logic above (executed inside NumericIndexedVector)
SELECT
dt,
murmurHash3_64(key64) % 1024 AS bucket_id,
groupNumericIndexedVectorState(
encode(bucket_id, key64), -- per-bucket dictionary lookup + auto-increment
val) AS v_state
FROM raw_64bit_table
GROUP BY dt, bucket_id;
Data model:
CREATE TABLE vec_64 (
dt Date,
bucket_id UInt32, -- murmurHash3_64(key64) % 1024
v AggregateFunction(groupNumericIndexedVector, UInt32, UInt64),
PRIMARY KEY (dt, bucket_id)
);
Three typical query operations:
Pointwise operations: inner-join the two tables on bucket_id, then perform pointwise operations within each matched bucket; when the buckets do not match on both sides, behavior follows the operator’s semantics — addition fills with 0, multiplication drops the bucket, subtraction keeps the left side.
SELECT numericIndexedVectorAllValueSum(
numericIndexedVectorPointwiseMultiply(
groupNumericIndexedVectorMerge(a.v_state),
groupNumericIndexedVectorMerge(b.v_state)))
FROM vec_a a INNER JOIN vec_b b
ON a.bucket_id = b.bucket_id;
Global aggregation: AllValueSum / Cardinality reduce per bucket, then perform a scalar reduction across buckets (sum-of-sums, sum-of-cardinalities).
Reconstruct as a Map: ToMap produces Map(UInt32, value) per bucket — the keys are the position encoding values passed at construction (i.e., the result of encode(bucket_id, key64)). To recover the original 64-bit key, the application must maintain a per-bucket reverse mapping (bucket_id, position) → key64; the reverse lookup must include bucket_id, since the same position in different buckets corresponds to different key64 values.
The specific bucket count can be tuned to the business scale (1024 here as an example).
When bucketing is designed correctly, the §4 benchmark numbers become reproducible — both storage and query payoff depend on the hash + bucket + Position Encoding combination being in place.
2. Applicable Scenarios
The engineering cost of migrating to NumericIndexedVector includes: an additional construction path, materialized-view refactoring, and downstream SQL rewrites. The next step is therefore deciding whether the workload at hand is worth migrating.
2.1 Scenarios suited for migration
Migration delivers significant payoff when all three conditions hold:
UInt-keyed join, orGROUP BY key, with high key cardinality (tens of millions to a billion);- The participating columns are integers or floats (or quantizable to integers); integers fit within
UInt64or the matching signed type, and floats use the default 40 + 24 fixed-point encoding (precision caveat in §2.2); - The downstream operations are pointwise arithmetic / conditional filters / sum / count — operations that vectorize naturally.
2.2 Scenarios not suited for migration
- Values are
Decimalor strings. BSI is only defined over integers (including fixed-point reals) and cannot directly handle these value types.Decimal’s core promise is exact decimal arithmetic (e.g.0.1 + 0.2 = 0.3exactly), but BSI’s fixed-point uses base-2 (steps of ) and cannot represent all base-10 fractions exactly — forcingDecimalthrough it degrades it into a float and discards its very reason for existing. The recommended approach is to multiply by a constant upstream and store as integer (e.g. money in cents). - The workload requires
GROUP BYon a non-key column. The vector’s dimension is the key, so changing the group dimension is equivalent to rebuilding the vector. - The workload requires window functions, rank, or order-dependent semantics. Position Encoding preserves neither time nor any other ordering.
- Key cardinality is low (under a million) with simple query patterns. Row-wise typically delivers lower absolute latency, and the vector construction overhead becomes a net loss.
Note on float support: floats are supported — they take the fixed-point path described in the previous post (24 fractional bits by default, ~7 decimal digits of precision). If the workload has extreme precision requirements (beyond what 24 fractional bits can represent), multiplying by a constant upstream and storing as integer is recommended.
2.3 One-line summary
If the SQL has the form
SELECT key, f(cols...) FROM ... GROUP BY keyorA JOIN B ON A.key = B.keyfollowed by a key-aware aggregation, with key cardinality andcolsare real-valued (integer or float), the workload qualifies for migration.
3. Usage Guide
Once a workload qualifies, the usage guide has two parts: three common SQL rewrite patterns (§3.1–§3.3) and two construction approaches for the write path (§3.4–§3.5).
3.1 SQL rewrite: single-table pointwise arithmetic
Scenario: arithmetic between columns of a single table under the same key.
-- row-wise
SELECT key, val_a * val_b + val_c AS combined
FROM T;
Vector rewrite:
WITH
va AS (SELECT groupNumericIndexedVector(key, val_a) FROM T),
vb AS (SELECT groupNumericIndexedVector(key, val_b) FROM T),
vc AS (SELECT groupNumericIndexedVector(key, val_c) FROM T)
SELECT numericIndexedVectorToMap(
numericIndexedVectorPointwiseAdd(
numericIndexedVectorPointwiseMultiply(va, vb),
vc));
Sources of payoff:
- Row-wise scan is replaced by bitwise RoaringBitmap operations. The high-bit bitmaps are typically empty, so the CPU’s actual processed bit count is far smaller than the row-wise byte total.
- Use
numericIndexedVectorToMapfor per-key results; usenumericIndexedVectorAllValueSumfor global aggregation. - Single-table speedups typically land at a few times to roughly 10×, an order of magnitude smaller than the join case below.
3.2 SQL rewrite: two-table key-join + aggregation
Scenario: A JOIN B ON A.key = B.key followed by pointwise arithmetic + global or per-key aggregation. This pattern delivers the most significant payoff.
-- row-wise: hash join + row-wise sum
SELECT sum(A.val_a * B.val_b)
FROM A JOIN B ON A.key = B.key;
Vector rewrite:
WITH
va AS (SELECT groupNumericIndexedVector(key, val_a) FROM A),
vb AS (SELECT groupNumericIndexedVector(key, val_b) FROM B)
SELECT numericIndexedVectorAllValueSum(
numericIndexedVectorPointwiseMultiply(va, vb));
va and vb share the same Position Encoding, so “same position means same key” — the hash-join semantics are absorbed by the encoding; no shuffle, no row-wise scan. Replace the outer AllValueSum with ToMap for per-key results.
The payoff is not linear — the larger the key cardinality and the more skewed the join, the more significant the gain. The paper reports the same workload’s pre-aggregation CPU consumption dropping to ~1/4, and ad-hoc mean latency falling from 22.3 s to 6.0 s (full table in §4).
3.3 SQL rewrite: conditional sum
Scenario: sum(if(cond, val, 0)) / sum(val) WHERE cond — the most common conditional aggregation.
-- row-wise
SELECT sum(if(val_a < 100, val_b, 0))
FROM T;
Vector rewrite:
WITH
va AS (SELECT groupNumericIndexedVector(key, val_a) FROM T),
vb AS (SELECT groupNumericIndexedVector(key, val_b) FROM T)
SELECT numericIndexedVectorAllValueSum(
numericIndexedVectorPointwiseMultiply(
numericIndexedVectorPointwiseLess(va, 100),
vb));
Key points:
numericIndexedVectorPointwiseLess(va, 100)outputs a vector with values — the set of keys satisfying the predicate;PointwiseMultiplyuses that mask as a filter, multiplying it withvb;AllValueSumproduces the final result.
Compare → multiply-mask → aggregate is the most common three-step composition in vector SQL. The join rewrite in §3.2 is often nested with an additional conditional mask, following the same structure.
3.4 Construction timing: ad-hoc vs pre-aggregation
Once the SQL form is determined, the remaining question is when these vectors are constructed. Ad-hoc construction (running groupNumericIndexedVector at query time) and pre-aggregation (using a materialized view to persist vectors into an AggregateFunction column at write time) are two distinct engineering approaches.
Ad-hoc construction — read the table directly at query time with SELECT groupNumericIndexedVector(key, val) FROM T:
- Suited for: manageable key cardinality, bounded scanned partition count, flexible query dimensions (the same table is queried with different group dimensions).
- Cost: every query must scan the column and construct the vector; the construction path itself is non-trivial.
Pre-aggregation — create a target table with an AggregateFunction(groupNumericIndexedVector, ...) column, and use a materialized view to persist the vector at write time:
CREATE MATERIALIZED VIEW mv_vectors TO target_table AS
SELECT
dt,
groupNumericIndexedVectorState(key, val_a) AS va_state,
groupNumericIndexedVectorState(key, val_b) AS vb_state
FROM source_table
GROUP BY dt;
At query time, use groupNumericIndexedVectorMerge to combine states into a vector and pass to pointwise / aggregation operators:
SELECT numericIndexedVectorAllValueSum(
numericIndexedVectorPointwiseMultiply(
groupNumericIndexedVectorMerge(va_state),
groupNumericIndexedVectorMerge(vb_state)))
FROM target_table
WHERE dt BETWEEN '2026-01-01' AND '2026-01-29';
- Suited for: high-frequency workloads where the same vector table is queried repeatedly — A/B-experiment metric platforms, user-profile lookups, order / retention dashboards.
- Cost: the write path becomes longer, and storage gains an extra
AggregateFunctioncolumn. The headline numbers in §4 come primarily from this path.
3.5 Notes
| Dimension | Conclusion | Reason |
|---|---|---|
| Bit-width selection | Round up conservatively to the value range (use UInt32 when UInt24 suffices) | Unused high-bit layers correspond to empty bitmaps with near-zero storage and compute cost; overflow forces truncation or an exception |
| Signed vs unsigned | Force signed when negative values or differences may occur | Two’s-complement extension is fixed at construction time; converting later requires rebuilding all bitmaps |
| Position-Encoding consistency | Two vectors used in pointwise ops must come from the same encoding | ”Same position” maps to different keys under different encodings; cross-partition / cross-version merges must go through Merge first (§1.2) |
| Operations not applicable | Do not apply window functions / sort / rank to vectors | Position Encoding preserves neither time nor any ordering |
4. Measured Payoff and Comparisons
All numbers below are from the Xiong & Wang paper’s measurements in WeChat scenarios. Folding the four key tables into one:
| Dimension | Row-wise | NumericIndexedVector | Gain |
|---|---|---|---|
| Total storage | 4.1 TB | 1.6 TB | ~40% (60% reduction) |
Single global sum time | 59.2 s | 0.6 s | ~100× |
| Pre-aggregation CPU hours | x | x / 4 | ~4× |
| Ad-hoc query mean latency | 22.3 s | 6.0 s | ~3.7× |
Each of the four corresponds to a different source of savings, explained below.
4.1 Storage reduced by 60%: BSI is a compression format
A row-wise (key, value) column stores a full tuple for each pair (e.g. ~12 bytes for UInt32 + UInt64), with storage cost growing linearly with the number of unique keys. The vector form splits this into two layers: Position Encoding maps keys to compact 0..N-1 positions; BSI then slices the value across multiple bitmaps, each backed by RoaringBitmap for compact storage — post-encoding contiguous positions land on Run containers (nearly free), and bitmaps for the high bits of the value are typically all-empty (zero cost). Sparse columns are naturally compact under BSI — the uncompressed vector size is comparable to the row-wise column compressed with LZ4 / ZSTD. BSI completes compression at the encoding stage, so applying generic compression on top yields limited additional gain.
4.2 Single sum sped up ~100×: SIMD + sparse bit-width
numericIndexedVectorAllValueSum operates directly on the compressed representation, performing POPCOUNT × bit-weight per bit. The high-bit bitmaps are typically empty, so the CPU’s actual processed data drops to single-digit percent of row-wise; combined with SIMD, this yields the 100× speedup. The optimization ceiling for row-wise is “no decompress, no deserialize,” but the column’s total bytes are a fixed cost that cannot be eliminated.
4.3 Pre-aggregation CPU reduced to ~1/4: the join is absorbed by the encoding
This payoff comes from the join rewrite in §3.2. Hash join + shuffle cost grows linearly with key cardinality; switching to PointwiseMultiply + AllValueSum means key is no longer a matched dimension — it has been encoded into the position. Both CPU and network bandwidth drop, and the shuffle phase is eliminated entirely.
4.4 Ad-hoc mean latency reduced ~3.7×: three savings combined
Ad-hoc query latency reduction comes from three combined savings: smaller storage → less I/O; vectorized compute → less CPU; join elimination → no shuffle jitter. The 22.3 s → 6.0 s mean obscures the tail-latency story — for join-heavy queries, the vector approach’s p99 improvement exceeds its mean improvement.
4.5 Side-by-side comparisons and limits
- vs
groupBitmap:groupBitmaponly answers “is this key present?”, carrying no value;NumericIndexedVectoris its strict superset. But in valueless profiling / audience-selection scenarios,groupBitmapremains the right choice. - vs row-wise
(key, value)+ hash join: row-wise is more flexible, but its cost grows linearly with key cardinality. For low-cardinality / simple queries, row-wise typically delivers lower absolute latency. - vs dense array / fixed-length numeric vector: small key spaces (under a million contiguous ids) are faster as dense arrays; sparse / large key spaces (tens of millions, non-contiguous) require BSI to fit storage.
- Scenarios not applicable: high-cardinality non-key column groupings,
Decimal/ string value domains, window / temporal semantics. Consistent with §2.2.
Wrapping Up
The optimal use case for NumericIndexedVector is “high-cardinality UInt key + real-valued columns (integer or float) + pointwise / sum / count downstream”. In that scenario, the paper’s measured payoff is storage -60%, single-shot sum 100×, pre-aggregation CPU 4×, ad-hoc latency 3.7×. This post is organized as “bucketing → applicable scenarios → usage guide” precisely because the first two engineering designs (bucketing and Position Encoding) determine whether the above payoff can be realized in full.
Reading the previous post’s “algorithm layer” alongside this post’s “engineering layer”, several design trade-offs run through the entire series and tend to be overlooked:
- Compression happens at the encoding stage: BSI’s sparsity across multiple bitmaps itself carries the compression role, distinct from row-wise columns relying on LZ4 / ZSTD. This explains why generic compression on top of BSI yields limited additional gain.
- The essence of join elimination is the key entering position: Position Encoding encodes “match by key” into the position, so pointwise operations are equivalent to implicit joins. Whether this payoff is achievable depends on both vectors sharing the same Position Encoding — this is why “Position-Encoding consistency” is listed in the §3.5 notes table.
- Bucketing + Position Encoding is the engineering boundary: per-bucket vector storage cost is highly sensitive to the post-Position-Encoding position distribution; the gap between best and worst tiers can be an order of magnitude. Business keys must go through hash + bucketing before construction; never dump them into a single vector.
- 64-bit key support isn’t a separate extension — it’s the same scheme with a different hash function and bucketing form: replace
murmurHash3_32 + high-bit-segment bucketing(§1.1) withmurmurHash3_64 + mod bucketing(§1.3); within-bucket Position Encoding is unchanged.