Skip to main content

pd_cipher::encryptors::vector_lsh

Vector token encryption via keyed locality-sensitive hashing. Secure Vector Projection via keyed Locality-Sensitive Hashing.

This module is the third pillar of PD.Cipher, alongside the existing text and numeric token transforms. Vectors are encrypted to deterministic byte tokens that preserve cosine similarity, the structural property every vector-consuming ML workload (RAG, k-NN, retrieval) needs.

Pattern parity with the existing pillars

PillarPreservesHidesMechanism
TextequalityplaintextBlake2bMac512 keyed PRF, truncated
Numericordermagnitudekeyed monotonic piecewise-linear
Vectorcosine similaritycoords, norm, dkeyed PRF → hyperplane signs (SimHash)

The cryptographic assumption is the same one the existing tokens already rely on: that Blake2bMac512 under a secret key is a PRF. The threat model is the same: passive adversary, ciphertext-only, no plaintext-ciphertext pairs available.

Algorithm

For each output bit i ∈ [0, L):

  1. Derive a 32-byte seed from (key, projection_domain, bit_length, d, i) via Blake2b-MAC.
  2. Seed a ChaCha20Rng; sample d standard-normal floats g_1, ..., g_d.
  3. Compute dot = Σ x_j · g_j and emit bit i = (dot >= 0).

For two vectors a, b with angle θ, the probability that a single random hyperplane separates their signs is θ/π (Charikar 2002, building on Goemans-Williamson 1995). The Hamming distance between two L-bit tokens therefore estimates θ as π · H(a,b) / L, with standard error O(1/√L).

Compactness

Output is bit_length / 8 bytes regardless of input dimension. A 1536-dim f32 embedding (6144 bytes) becomes a 32-byte token at the default L = 256: a 192× reduction.

Structs

VectorToken

A deterministic keyed token representing an encrypted vector.

Vector tokens preserve cosine similarity in the same way numeric tokens preserve order and text tokens preserve equality: the structural property the workload needs survives encryption; the underlying values do not.

Properties

  • Deterministic: identical (vector, key, projection_domain, bit_length) always produce identical tokens.
  • Compact: bit_length / 8 bytes, independent of vector dimension.
  • Comparable without the key: [cosine_similarity] recovers the angle between the two original vectors directly from the tokens.

Wire format

bytes[i / 8] stores bit i at position 1u8 << (i % 8) (LSB-first within each byte). Hamming distance is XOR + popcount; bit order within a byte is irrelevant for comparison.

pub struct VectorToken

Functions

encrypt_vector_to_token

Encrypts a real-valued vector into a deterministic similarity-preserving token.

Same (vector, key, projection_domain, bit_length) always produces the same VectorToken. The token reveals cosine similarity to other tokens from the same context and nothing more.

Errors

  • ValidationError if bit_length is invalid (see [validate_token_bits]), the vector is empty, or any coordinate is NaN / infinite.
  • InvalidKey if the key bytes are unsuitable for the Blake2b-MAC PRF.
pub fn encrypt_vector_to_token(...)

cosine_similarity

Estimates cosine similarity between the underlying vectors of two tokens, using only the tokens — no key, no original vectors.

This is the vector-tier analog of comparing two numeric tokens for order: the structural property is encoded directly in the ciphertext.

Estimator

cos(θ) ≈ cos(π · H(a, b) / L), where H is the Hamming distance between the two L-bit tokens. Standard error scales as O(1/√L).

Errors

Returns ValidationError if the tokens have different bit lengths or originate from different encryption contexts.

pub fn cosine_similarity(...)

validate_token_bits

Validates a candidate vector_token_bits setting.

pub fn validate_token_bits(...)

rank_by_similarity

Returns the top-k corpus indices ranked by estimated cosine similarity to the query.

Indices come back in descending order of similarity. Ties are broken arbitrarily but deterministically (by original index, ascending).

Errors

Propagates any [cosine_similarity] error encountered while scoring the corpus. The query and every corpus token must come from the same encryption context.

pub fn rank_by_similarity(...)

Constants

MAX_VECTOR_TOKEN_BITS

Maximum supported vector token length (8 KiB per token).

pub const MAX_VECTOR_TOKEN_BITS: ...

MIN_VECTOR_TOKEN_BITS

Minimum supported vector token length.

pub const MIN_VECTOR_TOKEN_BITS: ...

DEFAULT_VECTOR_TOKEN_BITS

Default vector token length in bits (256 bits = 32 bytes).

Empirically gives ~±0.04 cosine standard error on the recovery estimator.

pub const DEFAULT_VECTOR_TOKEN_BITS: ...