The previous post covered the internals of NumericIndexedVector — BSI encoding, the bit-level algorithms for add / sub / mul / div / compare, and the three-layer “type + aggregate function + regular UDF” structure on the ClickHouse side. This post covers its best practices: bucketing design and Position Encoding mechanics, how to determine which workloads to migrate, how to rewrite the SQL and design the write path, plus the measured payoff from the paper.

All benchmark numbers below come from the paper by Xiong & Wang, measured in WeChat scenarios.

1. Bucketing and Position Encoding

Whether NumericIndexedVector works well comes down to two things — bucketing and Position Encoding. Bucketing lets vectors shard across a distributed cluster and supports the variance estimation experiment platforms need. Position Encoding does another round of “key → compact position” re-encoding within each bucket, so every BSI bitmap lands close to optimal storage. This section covers both for the 32-bit key case (§1.1 + §1.2), then extends to 64-bit keys in §1.3 — which is essentially the same hash + bucket + within-bucket encoding scheme.

1.1 The two core purposes of bucketing + the MurmurHash32 scheme

Bucketing in NumericIndexedVector serves two purposes:

  1. Data sharding: different buckets can be distributed across different ClickHouse shards, so both construction and queries leverage distributed cluster parallelism.
  2. Variance estimation: in experiment / A-B testing platforms, metrics need variance estimates. If buckets are randomly and statistically independent, variance can be estimated directly on bucket aggregates without rescanning the raw data (see Covariance Estimation and its Application in Large-Scale Online Controlled Experiments).

The common scheme that satisfies both purposes is MurmurHash32 + high-bit contiguous-segment bucketing (for UInt32 keys):

  • MurmurHash32 is a bijection within the UInt32 space — business keys hash to a uniformly distributed value, and the original key is recoverable from the hash. This property simultaneously delivers: even sharding (purpose 1) and per-bucket randomness / independence (purpose 2).
  • Take the top 10 bits of the hash as the bucket id, giving 1024 buckets. The bucket id is used purely for sharding routing (which shard / which vector the data lands in).
  • Inside each vector, the full 32-bit hash value is stored (without truncating the high 10 bits). The in-bucket hash values are concentrated in a 2222^{22} (≈ 4M) contiguous hash segment — specifically [bucket_id × 2^{22}, (bucket_id+1) × 2^{22}). The number of in-bucket hash values is roughly 1/1024 of the total cardinality (tens to hundreds of thousands), still sparse across those 4M positions; §1.2’s Position Encoding does the per-bucket 0-based re-encoding to push each BSI bitmap close to optimal.
-- MurmurHash32 + top 10 bits as bucket id (UInt32 key)
SELECT
  dt,
  bitShiftRight(murmurHash3_32(key), 22) AS bucket_id,
  groupNumericIndexedVectorState(murmurHash3_32(key), val) AS v_state    -- in-vector key is the full hash
FROM source_table
GROUP BY dt, bucket_id;

Three anti-patterns, each failing at one of the three design layers (“bucketing / hashing / high-bit contiguous segments”):

  • Skip bucketing, dump all business keys into a single vector: all data lands in one vector, which can’t be distributed across shards — the cluster’s parallelism goes unused; a single vector also gets too large, slowing both construction and queries.
  • Bucket without hashing: bucketing directly on the high bits of the raw business key. Business-key high bits are typically uneven (user-ids cluster in certain ranges, order-ids increase by time, etc.), so buckets become unbalanced — sharding is uneven and variance estimation breaks down.
  • Hash, but bucket by mod or other random methods (instead of high-bit contiguous segments): keys within a bucket no longer cluster in a continuous hash range — they scatter across the entire 32-bit space. The underlying RoaringBitmap’s top-level containers each receive very few keys, almost all stored as Array containers; per-bucket storage cost rises and queries get slower.

1.2 Position Encoding: further within-bucket compactness

§1.1 shrunk each bucket’s hash range to a contiguous 2222^{22} ≈ 4M segment. But the hash values actually present in a bucket (tens to hundreds of thousands) are still sparsely distributed across those 4M positions — they don’t automatically land on a contiguous 0..N range.

Position Encoding does another round of re-encoding within each bucket: it maps the keys actually present in the bucket to consecutive positions 0, 1, 2, ... in order of appearance, and every BSI bitmap is built on this contiguous position space (paper §3.4).

Two engineering benefits:

  • In-bucket BSI bitmaps become extremely compact: each bitmap’s cardinality equals “the number of non-zero keys”, and they are 0-based contiguous integers, landing on RoaringBitmap’s Run / Array containers — storage cost is close to optimal. RoaringBitmap’s three container tiers:

    In-bucket position distributionContainerPer-bucket cost
    Many contiguous ranges (the post-encoding norm)Run: (start, length) pairsSmallest, tens of bytes per range
    Sparse (≤ 4096)Array: sorted array (2 bytes / element)8\leq 8 KB
    Dense random (> 4096)Bitmap: 8 KB raw bitmapFixed 8 KB

    Position Encoding concentrates per-bucket positions onto a few top-level containers (mostly the cheapest Run container), bringing per-bucket cost to tens of bytes to a few KB. Without Position Encoding, in-bucket keys scatter across the business key space, requiring many more top-level containers; total storage cost rises with key sparsity.

  • Sharing the encoding eliminates joins: when two vectors share the same Position Encoding, “same position means same key”. Subsequent pointwise operations therefore implicitly perform “match by key” without any explicit join or shuffle.

Position Encoding consistency is also an implicit prerequisite during the usage stage: the two vectors participating in a pointwise operation must come from the same encoding, otherwise “same position” maps to different keys. Cross-partition / cross-version merges must go through Merge to unify the encoding first.

1.3 Lift index to 64 bits

Business keys are often 64 bits — global snowflake ids, cross-tenant global user-ids, hash fingerprints. Supporting 64-bit keys uses the same hash + bucket + within-bucket Position Encoding scheme as §1.1, just with a 64-bit hash function and the more direct mod-based bucketing:

  • Hash + bucket: use murmurHash3_64 to scatter business keys into 64-bit hash values, then take the hash modulo 1024 as bucket_id, giving 1024 buckets (same magnitude as §1.1, sufficient for sharding while keeping per-bucket cardinality manageable);

  • Within-bucket Position Encoding: each bucket maintains an encoding dictionary + current max position counter, mapping each key seen to a 0-based contiguous position — already-seen keys return the same position, new keys get max + 1. This is the Position Encoding logic itself:

    # Each bucket_id keeps its own encoding dictionary + current max position
    encoding: Dict[bucket_id, Dict[key, UInt32]] = {}
    next_pos: Dict[bucket_id, UInt32] = {}
    
    def encode(bucket_id, key) -> UInt32:
        if key in encoding[bucket_id]:
            return encoding[bucket_id][key]   # already encoded, return existing position
        pos = next_pos[bucket_id]
        encoding[bucket_id][key] = pos        # new key, assign the next contiguous position
        next_pos[bucket_id] = pos + 1
        return pos

Current limitation: NumericIndexedVector in ClickHouse does not natively support 64-bit keys today (the key parameter type tops out at UInt32, see Post 1 §6). The “hash-based bucketing + per-bucket Position Encoding” pattern in this section is the common engineering composition that uses the existing 32-bit interface to handle 64-bit business keys.

Capacity: each bucket can hold up to 2322^{32} unique keys (Position Encoding uses UInt32 for the position); total system capacity is roughly 1024×2324.4×10121024 \times 2^{32} \approx 4.4 \times 10^{12}, sufficient for most current businesses.

-- Conceptual SQL: encode is the Position Encoding logic above (executed inside NumericIndexedVector)
SELECT
  dt,
  murmurHash3_64(key64) % 1024 AS bucket_id,
  groupNumericIndexedVectorState(
    encode(bucket_id, key64),    -- per-bucket dictionary lookup + auto-increment
    val) AS v_state
FROM raw_64bit_table
GROUP BY dt, bucket_id;

Data model:

CREATE TABLE vec_64 (
  dt Date,
  bucket_id UInt32,     -- murmurHash3_64(key64) % 1024
  v AggregateFunction(groupNumericIndexedVector, UInt32, UInt64),
  PRIMARY KEY (dt, bucket_id)
);

Three typical query operations:

Pointwise operations: inner-join the two tables on bucket_id, then perform pointwise operations within each matched bucket; when the buckets do not match on both sides, behavior follows the operator’s semantics — addition fills with 0, multiplication drops the bucket, subtraction keeps the left side.

SELECT numericIndexedVectorAllValueSum(
  numericIndexedVectorPointwiseMultiply(
    groupNumericIndexedVectorMerge(a.v_state),
    groupNumericIndexedVectorMerge(b.v_state)))
FROM vec_a a INNER JOIN vec_b b
  ON a.bucket_id = b.bucket_id;

Global aggregation: AllValueSum / Cardinality reduce per bucket, then perform a scalar reduction across buckets (sum-of-sums, sum-of-cardinalities).

Reconstruct as a Map: ToMap produces Map(UInt32, value) per bucket — the keys are the position encoding values passed at construction (i.e., the result of encode(bucket_id, key64)). To recover the original 64-bit key, the application must maintain a per-bucket reverse mapping (bucket_id, position) → key64; the reverse lookup must include bucket_id, since the same position in different buckets corresponds to different key64 values.

The specific bucket count can be tuned to the business scale (1024 here as an example).

When bucketing is designed correctly, the §4 benchmark numbers become reproducible — both storage and query payoff depend on the hash + bucket + Position Encoding combination being in place.

2. Applicable Scenarios

The engineering cost of migrating to NumericIndexedVector includes: an additional construction path, materialized-view refactoring, and downstream SQL rewrites. The next step is therefore deciding whether the workload at hand is worth migrating.

2.1 Scenarios suited for migration

Migration delivers significant payoff when all three conditions hold:

  1. UInt-keyed join, or GROUP BY key, with high key cardinality (tens of millions to a billion);
  2. The participating columns are integers or floats (or quantizable to integers); integers fit within UInt64 or the matching signed type, and floats use the default 40 + 24 fixed-point encoding (precision caveat in §2.2);
  3. The downstream operations are pointwise arithmetic / conditional filters / sum / count — operations that vectorize naturally.

2.2 Scenarios not suited for migration

  • Values are Decimal or strings. BSI is only defined over integers (including fixed-point reals) and cannot directly handle these value types. Decimal’s core promise is exact decimal arithmetic (e.g. 0.1 + 0.2 = 0.3 exactly), but BSI’s fixed-point uses base-2 (steps of 2f2^{-f}) and cannot represent all base-10 fractions exactly — forcing Decimal through it degrades it into a float and discards its very reason for existing. The recommended approach is to multiply by a constant upstream and store as integer (e.g. money in cents).
  • The workload requires GROUP BY on a non-key column. The vector’s dimension is the key, so changing the group dimension is equivalent to rebuilding the vector.
  • The workload requires window functions, rank, or order-dependent semantics. Position Encoding preserves neither time nor any other ordering.
  • Key cardinality is low (under a million) with simple query patterns. Row-wise typically delivers lower absolute latency, and the vector construction overhead becomes a net loss.

Note on float support: floats are supported — they take the fixed-point path described in the previous post (24 fractional bits by default, ~7 decimal digits of precision). If the workload has extreme precision requirements (beyond what 24 fractional bits can represent), multiplying by a constant upstream and storing as integer is recommended.

2.3 One-line summary

If the SQL has the form SELECT key, f(cols...) FROM ... GROUP BY key or A JOIN B ON A.key = B.key followed by a key-aware aggregation, with key cardinality 107\geq 10^7 and cols are real-valued (integer or float), the workload qualifies for migration.

3. Usage Guide

Once a workload qualifies, the usage guide has two parts: three common SQL rewrite patterns (§3.1–§3.3) and two construction approaches for the write path (§3.4–§3.5).

3.1 SQL rewrite: single-table pointwise arithmetic

Scenario: arithmetic between columns of a single table under the same key.

-- row-wise
SELECT key, val_a * val_b + val_c AS combined
FROM T;

Vector rewrite:

WITH
  va AS (SELECT groupNumericIndexedVector(key, val_a) FROM T),
  vb AS (SELECT groupNumericIndexedVector(key, val_b) FROM T),
  vc AS (SELECT groupNumericIndexedVector(key, val_c) FROM T)
SELECT numericIndexedVectorToMap(
  numericIndexedVectorPointwiseAdd(
    numericIndexedVectorPointwiseMultiply(va, vb),
    vc));

Sources of payoff:

  • Row-wise scan is replaced by bitwise RoaringBitmap operations. The high-bit bitmaps are typically empty, so the CPU’s actual processed bit count is far smaller than the row-wise byte total.
  • Use numericIndexedVectorToMap for per-key results; use numericIndexedVectorAllValueSum for global aggregation.
  • Single-table speedups typically land at a few times to roughly 10×, an order of magnitude smaller than the join case below.

3.2 SQL rewrite: two-table key-join + aggregation

Scenario: A JOIN B ON A.key = B.key followed by pointwise arithmetic + global or per-key aggregation. This pattern delivers the most significant payoff.

-- row-wise: hash join + row-wise sum
SELECT sum(A.val_a * B.val_b)
FROM A JOIN B ON A.key = B.key;

Vector rewrite:

WITH
  va AS (SELECT groupNumericIndexedVector(key, val_a) FROM A),
  vb AS (SELECT groupNumericIndexedVector(key, val_b) FROM B)
SELECT numericIndexedVectorAllValueSum(
  numericIndexedVectorPointwiseMultiply(va, vb));

va and vb share the same Position Encoding, so “same position means same key” — the hash-join semantics are absorbed by the encoding; no shuffle, no row-wise scan. Replace the outer AllValueSum with ToMap for per-key results.

The payoff is not linear — the larger the key cardinality and the more skewed the join, the more significant the gain. The paper reports the same workload’s pre-aggregation CPU consumption dropping to ~1/4, and ad-hoc mean latency falling from 22.3 s to 6.0 s (full table in §4).

3.3 SQL rewrite: conditional sum

Scenario: sum(if(cond, val, 0)) / sum(val) WHERE cond — the most common conditional aggregation.

-- row-wise
SELECT sum(if(val_a < 100, val_b, 0))
FROM T;

Vector rewrite:

WITH
  va AS (SELECT groupNumericIndexedVector(key, val_a) FROM T),
  vb AS (SELECT groupNumericIndexedVector(key, val_b) FROM T)
SELECT numericIndexedVectorAllValueSum(
  numericIndexedVectorPointwiseMultiply(
    numericIndexedVectorPointwiseLess(va, 100),
    vb));

Key points:

  • numericIndexedVectorPointwiseLess(va, 100) outputs a vector with values {0,1}\in \{0, 1\}the set of keys satisfying the predicate;
  • PointwiseMultiply uses that mask as a filter, multiplying it with vb;
  • AllValueSum produces the final result.

Compare → multiply-mask → aggregate is the most common three-step composition in vector SQL. The join rewrite in §3.2 is often nested with an additional conditional mask, following the same structure.

3.4 Construction timing: ad-hoc vs pre-aggregation

Once the SQL form is determined, the remaining question is when these vectors are constructed. Ad-hoc construction (running groupNumericIndexedVector at query time) and pre-aggregation (using a materialized view to persist vectors into an AggregateFunction column at write time) are two distinct engineering approaches.

Ad-hoc construction — read the table directly at query time with SELECT groupNumericIndexedVector(key, val) FROM T:

  • Suited for: manageable key cardinality, bounded scanned partition count, flexible query dimensions (the same table is queried with different group dimensions).
  • Cost: every query must scan the column and construct the vector; the construction path itself is non-trivial.

Pre-aggregation — create a target table with an AggregateFunction(groupNumericIndexedVector, ...) column, and use a materialized view to persist the vector at write time:

CREATE MATERIALIZED VIEW mv_vectors TO target_table AS
SELECT
  dt,
  groupNumericIndexedVectorState(key, val_a) AS va_state,
  groupNumericIndexedVectorState(key, val_b) AS vb_state
FROM source_table
GROUP BY dt;

At query time, use groupNumericIndexedVectorMerge to combine states into a vector and pass to pointwise / aggregation operators:

SELECT numericIndexedVectorAllValueSum(
  numericIndexedVectorPointwiseMultiply(
    groupNumericIndexedVectorMerge(va_state),
    groupNumericIndexedVectorMerge(vb_state)))
FROM target_table
WHERE dt BETWEEN '2026-01-01' AND '2026-01-29';
  • Suited for: high-frequency workloads where the same vector table is queried repeatedly — A/B-experiment metric platforms, user-profile lookups, order / retention dashboards.
  • Cost: the write path becomes longer, and storage gains an extra AggregateFunction column. The headline numbers in §4 come primarily from this path.

3.5 Notes

DimensionConclusionReason
Bit-width selectionRound up conservatively to the value range (use UInt32 when UInt24 suffices)Unused high-bit layers correspond to empty bitmaps with near-zero storage and compute cost; overflow forces truncation or an exception
Signed vs unsignedForce signed when negative values or differences may occurTwo’s-complement extension is fixed at construction time; converting later requires rebuilding all bitmaps
Position-Encoding consistencyTwo vectors used in pointwise ops must come from the same encoding”Same position” maps to different keys under different encodings; cross-partition / cross-version merges must go through Merge first (§1.2)
Operations not applicableDo not apply window functions / sort / rank to vectorsPosition Encoding preserves neither time nor any ordering

4. Measured Payoff and Comparisons

All numbers below are from the Xiong & Wang paper’s measurements in WeChat scenarios. Folding the four key tables into one:

DimensionRow-wiseNumericIndexedVectorGain
Total storage4.1 TB1.6 TB~40% (60% reduction)
Single global sum time59.2 s0.6 s~100×
Pre-aggregation CPU hoursxx / 4~4×
Ad-hoc query mean latency22.3 s6.0 s~3.7×

Each of the four corresponds to a different source of savings, explained below.

4.1 Storage reduced by 60%: BSI is a compression format

A row-wise (key, value) column stores a full tuple for each pair (e.g. ~12 bytes for UInt32 + UInt64), with storage cost growing linearly with the number of unique keys. The vector form splits this into two layers: Position Encoding maps keys to compact 0..N-1 positions; BSI then slices the value across multiple bitmaps, each backed by RoaringBitmap for compact storage — post-encoding contiguous positions land on Run containers (nearly free), and bitmaps for the high bits of the value are typically all-empty (zero cost). Sparse columns are naturally compact under BSI — the uncompressed vector size is comparable to the row-wise column compressed with LZ4 / ZSTD. BSI completes compression at the encoding stage, so applying generic compression on top yields limited additional gain.

4.2 Single sum sped up ~100×: SIMD + sparse bit-width

numericIndexedVectorAllValueSum operates directly on the compressed representation, performing POPCOUNT × bit-weight per bit. The high-bit bitmaps are typically empty, so the CPU’s actual processed data drops to single-digit percent of row-wise; combined with SIMD, this yields the 100× speedup. The optimization ceiling for row-wise is “no decompress, no deserialize,” but the column’s total bytes are a fixed cost that cannot be eliminated.

4.3 Pre-aggregation CPU reduced to ~1/4: the join is absorbed by the encoding

This payoff comes from the join rewrite in §3.2. Hash join + shuffle cost grows linearly with key cardinality; switching to PointwiseMultiply + AllValueSum means key is no longer a matched dimension — it has been encoded into the position. Both CPU and network bandwidth drop, and the shuffle phase is eliminated entirely.

4.4 Ad-hoc mean latency reduced ~3.7×: three savings combined

Ad-hoc query latency reduction comes from three combined savings: smaller storage → less I/O; vectorized compute → less CPU; join elimination → no shuffle jitter. The 22.3 s → 6.0 s mean obscures the tail-latency story — for join-heavy queries, the vector approach’s p99 improvement exceeds its mean improvement.

4.5 Side-by-side comparisons and limits

  • vs groupBitmap: groupBitmap only answers “is this key present?”, carrying no value; NumericIndexedVector is its strict superset. But in valueless profiling / audience-selection scenarios, groupBitmap remains the right choice.
  • vs row-wise (key, value) + hash join: row-wise is more flexible, but its cost grows linearly with key cardinality. For low-cardinality / simple queries, row-wise typically delivers lower absolute latency.
  • vs dense array / fixed-length numeric vector: small key spaces (under a million contiguous ids) are faster as dense arrays; sparse / large key spaces (tens of millions, non-contiguous) require BSI to fit storage.
  • Scenarios not applicable: high-cardinality non-key column groupings, Decimal / string value domains, window / temporal semantics. Consistent with §2.2.

Wrapping Up

The optimal use case for NumericIndexedVector is “high-cardinality UInt key + real-valued columns (integer or float) + pointwise / sum / count downstream”. In that scenario, the paper’s measured payoff is storage -60%, single-shot sum 100×, pre-aggregation CPU 4×, ad-hoc latency 3.7×. This post is organized as “bucketing → applicable scenarios → usage guide” precisely because the first two engineering designs (bucketing and Position Encoding) determine whether the above payoff can be realized in full.

Reading the previous post’s “algorithm layer” alongside this post’s “engineering layer”, several design trade-offs run through the entire series and tend to be overlooked:

  • Compression happens at the encoding stage: BSI’s sparsity across multiple bitmaps itself carries the compression role, distinct from row-wise columns relying on LZ4 / ZSTD. This explains why generic compression on top of BSI yields limited additional gain.
  • The essence of join elimination is the key entering position: Position Encoding encodes “match by key” into the position, so pointwise operations are equivalent to implicit joins. Whether this payoff is achievable depends on both vectors sharing the same Position Encoding — this is why “Position-Encoding consistency” is listed in the §3.5 notes table.
  • Bucketing + Position Encoding is the engineering boundary: per-bucket vector storage cost is highly sensitive to the post-Position-Encoding position distribution; the gap between best and worst tiers can be an order of magnitude. Business keys must go through hash + bucketing before construction; never dump them into a single vector.
  • 64-bit key support isn’t a separate extension — it’s the same scheme with a different hash function and bucketing form: replace murmurHash3_32 + high-bit-segment bucketing (§1.1) with murmurHash3_64 + mod bucketing (§1.3); within-bucket Position Encoding is unchanged.