pd_cipher::encryptors::vector_lsh
Vector token encryption via keyed locality-sensitive hashing. Secure Vector Projection via keyed Locality-Sensitive Hashing.
This module is the third pillar of PD.Cipher, alongside the existing text and numeric token transforms. Vectors are encrypted to deterministic byte tokens that preserve cosine similarity, the structural property every vector-consuming ML workload (RAG, k-NN, retrieval) needs.
Pattern parity with the existing pillars
| Pillar | Preserves | Hides | Mechanism |
|---|---|---|---|
| Text | equality | plaintext | Blake2bMac512 keyed PRF, truncated |
| Numeric | order | magnitude | keyed monotonic piecewise-linear |
| Vector | cosine similarity | coords, norm, d | keyed PRF → hyperplane signs (SimHash) |
The cryptographic assumption is the same one the existing tokens already
rely on: that Blake2bMac512 under a secret key is a PRF. The threat model
is the same: passive adversary, ciphertext-only, no plaintext-ciphertext
pairs available.
Algorithm
For each output bit i ∈ [0, L):
- Derive a 32-byte seed from
(key, projection_domain, bit_length, d, i)via Blake2b-MAC. - Seed a
ChaCha20Rng; sampledstandard-normal floatsg_1, ..., g_d. - Compute
dot = Σ x_j · g_jand emit biti = (dot >= 0).
For two vectors a, b with angle θ, the probability that a single random
hyperplane separates their signs is θ/π (Charikar 2002, building on
Goemans-Williamson 1995). The Hamming distance between two L-bit tokens
therefore estimates θ as π · H(a,b) / L, with standard error O(1/√L).
Compactness
Output is bit_length / 8 bytes regardless of input dimension. A 1536-dim
f32 embedding (6144 bytes) becomes a 32-byte token at the default
L = 256: a 192× reduction.
Structs
VectorToken
A deterministic keyed token representing an encrypted vector.
Vector tokens preserve cosine similarity in the same way numeric tokens preserve order and text tokens preserve equality: the structural property the workload needs survives encryption; the underlying values do not.
Properties
- Deterministic: identical
(vector, key, projection_domain, bit_length)always produce identical tokens. - Compact:
bit_length / 8bytes, independent of vector dimension. - Comparable without the key: [
cosine_similarity] recovers the angle between the two original vectors directly from the tokens.
Wire format
bytes[i / 8] stores bit i at position 1u8 << (i % 8) (LSB-first within
each byte). Hamming distance is XOR + popcount; bit order within a byte is
irrelevant for comparison.
pub struct VectorToken
Functions
encrypt_vector_to_token
Encrypts a real-valued vector into a deterministic similarity-preserving token.
Same (vector, key, projection_domain, bit_length) always produces the
same VectorToken. The token reveals cosine similarity to other tokens
from the same context and nothing more.
Errors
ValidationErrorifbit_lengthis invalid (see [validate_token_bits]), the vector is empty, or any coordinate is NaN / infinite.InvalidKeyif the key bytes are unsuitable for the Blake2b-MAC PRF.
pub fn encrypt_vector_to_token(...)
cosine_similarity
Estimates cosine similarity between the underlying vectors of two tokens, using only the tokens — no key, no original vectors.
This is the vector-tier analog of comparing two numeric tokens for order: the structural property is encoded directly in the ciphertext.
Estimator
cos(θ) ≈ cos(π · H(a, b) / L), where H is the Hamming distance between
the two L-bit tokens. Standard error scales as O(1/√L).
Errors
Returns ValidationError if the tokens have different bit lengths or
originate from different encryption contexts.
pub fn cosine_similarity(...)
validate_token_bits
Validates a candidate vector_token_bits setting.
pub fn validate_token_bits(...)
rank_by_similarity
Returns the top-k corpus indices ranked by estimated cosine similarity
to the query.
Indices come back in descending order of similarity. Ties are broken arbitrarily but deterministically (by original index, ascending).
Errors
Propagates any [cosine_similarity] error encountered while scoring the
corpus. The query and every corpus token must come from the same encryption
context.
pub fn rank_by_similarity(...)
Constants
MAX_VECTOR_TOKEN_BITS
Maximum supported vector token length (8 KiB per token).
pub const MAX_VECTOR_TOKEN_BITS: ...
MIN_VECTOR_TOKEN_BITS
Minimum supported vector token length.
pub const MIN_VECTOR_TOKEN_BITS: ...
DEFAULT_VECTOR_TOKEN_BITS
Default vector token length in bits (256 bits = 32 bytes).
Empirically gives ~±0.04 cosine standard error on the recovery estimator.
pub const DEFAULT_VECTOR_TOKEN_BITS: ...