v1 Binary Format

The stable compatibility surface. Blocks, types, varints, checksums.

1. Overview

Surp is a compact, canonical binary serialization format designed as an alternative to JSON. It provides deterministic encoding, schema evolution support, and both human-readable and binary representations.

Design principles:

  1. Safety and correctness over micro-optimizations
  2. Deterministic encoding (same data → same bytes, always)
  3. Zero-copy decoding when possible
  4. Forward/backward compatible schema evolution
  5. Streaming and random-access support

2. File Layout

Spec · §2

File layout

BLOCK 00 StringDict prefix-delta entries 0x04 BLOCK 01 Data · block 0 wire-encoded values 0x01 BLOCK 02 Data · block 1 wire-encoded values 0x01 BLOCK 03 Data · block N wire-encoded values 0x01 BLOCK 04 Index optional · offsets 0x02 BLOCK 05 Trailer XXH64 over file 0xFF SEQUENTIAL · SELF-CONTAINED · TRAILER-VERIFIED

2.1 Block Header

Spec · §2.1

Block header anatomy

block_type 1 B offset → 0 block_len varint offset → 88 comp_type 1 B offset → 206 checksum 8 B · XXH64 offset → 304 payload block_len bytes offset → 442 BLOCK HEADER · LITTLE-ENDIAN checksum is computed over the uncompressed payload — integrity verified after decompression.
Field Size Description
block_type 1 byte Block type ID (see §2.3)
block_len varint Length of payload in bytes
comp_type 1 byte Compression algorithm ID
checksum 8 bytes XXH64 of uncompressed payload (LE)
payload N bytes Block data (N = block_len)

2.2 Block Types

ID Name Description
0x01 Data Encoded value data
0x02 Index Offset index for random access
0x03 Schema Embedded schema information
0x04 StringDict String dictionary for deduplication
0xFF Trailer File-level checksum (last block)

2.4 Compression Types

ID Name Description
0x00 None No compression
0x01 Zstd Zstandard compression
0x02 Snappy Snappy compression
0x03 LZ4 LZ4 frame compression

Adaptive compression (feature lz4/zstd/snappy): the encoder can sample the first N bytes of a block and select the algorithm that achieves the best ratio above a configurable threshold. See AdaptiveSelector in surp-compression.

2.5 Compressed Block Wire Format

When a block is compressed, the on-wire payload is:

1
uncompressed_len(varint) | compressed_data
  • The block_len field in the block header reflects the compressed size (including the varint prefix).
  • The checksum is computed on the uncompressed payload, so integrity is verified after decompression.
  • If the compressed output is not smaller than the original, the encoder falls back to comp_type = 0x00 (None).

2.6 Decode Paths

  • Zero-copy (decode_next()SurpValue<'a>): Borrows from the input slice. Rejects compressed blocks with a descriptive error.
  • Owned (decode_next_owned()Value): Works transparently with both compressed and uncompressed blocks. Decompresses into an internal buffer when needed.

3. Wire Types

Each value is prefixed with a tag byte:

  • Low 4 bits: wire type
  • High 4 bits: flags
ID Wire Type Payload
0x00 Null None
0x01 Bool 1 byte (0x00=false, 0x01=true)
0x02 VarUInt LEB128 unsigned integer
0x03 VarInt ZigZag + LEB128 signed integer
0x04 Fixed64 8 bytes little-endian (f64)
0x05 LenDelimited sub-type(1) + len(varint) + data
0x06 StartObject count(varint) + fields...
0x07 EndObject None
0x08 StartArray count(varint) + items...
0x09 EndArray None
0x0A Reference ref_id(varint)

3.1 LenDelimited Sub-types

ID Sub-type Description
0x00 String UTF-8 encoded string
0x01 Bytes Raw binary data

3.2 Object Encoding

Spec · §3.2

Object encoding

0x06 Start count varint field 0 key_len · key value (wire-encoded) field 1 key_len · key value (wire-encoded) field 2 key_len · key value (wire-encoded) 0x07 End OBJECT · BRACKETED STREAM Fields are emitted in insertion order; keys are UTF-8, never length-prefixed by type.

3.3 Array Encoding

Spec · §3.3

Array encoding

0x08 Start count varint value 0 wire-encoded value 1 wire-encoded value 2 wire-encoded value 3 wire-encoded 0x09 End ARRAY · HOMOGENEOUS OR MIXED

4. Integer Encoding

4.1 Unsigned: LEB128

Variable-length encoding: 7 bits of data per byte, MSB indicates continuation.

1
2
3
4
5
6
Value     Encoded bytes
0         00
127       7F
128       80 01
300       AC 02
16384     80 80 01

4.2 Signed: ZigZag + LEB128

ZigZag maps signed to unsigned: 0→0, -1→1, 1→2, -2→3, 2→4, ... Formula: encode(n) = (n << 1) ^ (n >> 63)

5. Checksums

5.1 Per-Block: XXH64

Every block includes an 8-byte XXH64 hash of its uncompressed payload (seed=0, little-endian).

Why XXH64: ~30 GB/s throughput, excellent collision resistance for non-cryptographic integrity checks. Adds < 0.1% overhead to typical workloads.

5.1.1 Alternative checksum algorithms (feature flags)

Algorithm Feature flag Performance Notes
XXH64 (default) ~30 GB/s Default, best-tested
XXH3-64 xxh3 ~50 GB/s Newer, SIMD-optimized
CRC32 compat-crc32 ~10 GB/s Legacy compatibility

The ChecksumAlgo enum provides a unified API for switching at runtime. The wire format always stores an 8-byte checksum field; CRC32 values are zero-extended to 8 bytes for backward compatibility.

5.2 File Trailer

The trailer block contains an XXH64 hash of all preceding bytes. This detects file-level truncation or corruption.

6. Endianness

Canonical wire format: little-endian.

All multi-byte fixed-width integers (f64, checksums) are stored in little-endian. LEB128 varints are byte-order independent by definition. On big-endian hosts, byte-swap operations are inserted by Rust's to_le_bytes()/from_le_bytes().

7. Schema Evolution

7.1 Field IDs

When using #[derive(Surp)], each field gets a stable integer ID via #[surp(id = N)]. Fields are matched by name in schema-less mode and by ID in schema-on-write mode.

7.2 Compatible Changes (minor version)

  • Adding new optional fields (with new IDs)
  • Adding new wire types with defined skip semantics

7.3 Incompatible Changes (major version)

  • Changing block framing
  • Changing existing wire type semantics
  • Removing the ability to skip unknown fields

7.4 Unknown Field Skipping

Decoders MUST be able to skip unknown wire types:

  • Null/Bool/End*: fixed size, trivially skipped.
  • VarUInt/VarInt: skip varint bytes.
  • Fixed64: skip 8 bytes.
  • LenDelimited: read length, skip that many bytes.
  • StartObject/StartArray: read count, recursively skip children.
  • Reference: skip varint ref_id.

8. String Dictionary (Per-Block)

Within a data block, repeated strings can be stored in a dictionary table. Subsequent occurrences reference the dictionary by index using the Reference wire type.

Algorithm (Encoder):

  1. When enable_dedup() is called, the encoder maintains a per-block HashMap<String, u32>.
  2. First occurrence: encode string normally (LenDelimited + string data), record in the map with the next sequential index.
  3. Subsequent occurrences: encode as Reference wire type (0x0A) with the dictionary index.
  4. On flush_block(), if the dictionary is non-empty, emit a StringDict block (type 0x04) before the data block.
  5. The dictionary is cleared between blocks.

8.1 StringDict Block Format

The StringDict block (type 0x04) is emitted immediately before its corresponding data block. It uses standard block framing:

1
block_type(0x04) | block_len(varint) | comp_type(0x00) | checksum(8B) | payload

Payload layout:

1
entry_count(varint) | entry₀ | entry₁ | ... | entryₙ₋₁

Each entry uses prefix-delta compression (entries are sorted lexicographically):

1
original_index(varint) | prefix_len(varint) | suffix_len(varint) | suffix_bytes
  • original_index: The insertion-order index matching Reference wire type IDs.
  • prefix_len: Number of bytes shared with the previous entry (0 for the first).
  • suffix_len: Length of the non-shared suffix.
  • suffix_bytes: The raw suffix bytes.

8.2 Decoder Handling

When the decoder encounters a StringDict block:

  1. Verify the block checksum.
  2. Parse the prefix-delta entries, reconstructing full strings.
  3. Populate the per-block string table in insertion order using original_index.
  4. Proceed to the next block (which should be a Data block).
  5. During data block decoding, Reference wire types resolve from the pre-populated table.

The StringDict block is consumed transparently — callers of decode_next() / decode_next_owned() never see it.

8.3 Prefix-Delta Compression

Entries in the StringDict block are sorted lexicographically before encoding. Each entry stores only the suffix that differs from the previous entry:

Spec · §8.3

StringDict prefix-delta

STRINGDICT · PREFIX-DELTA idx prefix_len suffix reconstructed 0 0 config_cache_host config_cache_host 1 13 port config_cache_port 2 7 database_host config_database_host 3 16 port config_database_port Entries are sorted lexicographically before encoding; each row stores only the byte-delta from the prior key.

This reduces dictionary overhead for datasets with structured/hierarchical key names.

9. Reference/Dedup

The Reference wire type (0x0A) currently encodes a varint index into the per-block string dictionary described in section 8. The implemented encoder uses it only for repeated UTF-8 strings when Encoder::enable_dedup() is set.

Reference IDs are scoped to a single block. Cross-block references are not supported (blocks are self-contained).

Structural subtree deduplication is not implemented in the v1 encoder or decoder. Future versions may add a separate reference table for that behavior, but files written by the current implementation must treat references as string-dictionary references.

10. Streaming vs Block Mode

Streaming Mode

  • Blocks are emitted as data arrives.
  • No index block.
  • Reader processes blocks sequentially.

Block Mode

  • Entire document encoded into one or more data blocks.
  • Optional index block at end for random access.
  • Optional schema block for self-describing files.

11. Security Considerations

See SECURITY.md for the full threat model.

Key points:

  • All lengths are bounds-checked against configurable limits.
  • Nesting depth is limited (default: 128).
  • Varint decoder rejects overlong encodings.
  • UTF-8 validation on all strings.
  • Per-block checksums prevent processing corrupted data.
  • Decompression output is bounded.
  • skip_value_at() enforces Limits (nesting + item count) during skip.

12. Feature Flags

All optimizations are gated behind Cargo feature flags to keep the default binary small and compilation fast.

12.1 surp-core features

Flag Dependencies Description
xxh3 Use XXH3-64 for checksums (faster SIMD path)
compat-crc32 crc32fast CRC32 checksum support for legacy interop
fast-alloc bumpalo Per-block bump allocator via BumpDecoder
lz4 lz4_flex LZ4 block compression support
zstd zstd Zstandard block compression support
snappy snap Snappy block compression support

12.2 surp-io features

Flag Dependencies Description
mmap memmap2 Memory-mapped zero-copy file reader (MmapReader)

12.3 surp-compression features

Flag Dependencies Description
lz4 lz4_flex LZ4 compression support
zstd zstd Zstandard compression
snappy snap Snappy compression

12.4 surp-simd features

Flag Dependencies Description
simd-varint NEON SIMD-accelerated varint boundary pre-scan

13. Text Format

The Surp text format is a deterministic, human-readable notation that maps 1:1 to the binary format. See docs/TEXT_FORMAT.abnf for the normative ABNF grammar (RFC 5234).

Key differences from JSON:

  • Object fields terminated by ; not ,
  • Binary literals: b64#<base64>; object fields still use the normal field ;
  • Optional type annotations: 42::u32
  • Comments: // line and /* block */
  • Signed integers use explicit +/- prefix

14. CLI Reference

The surp CLI tool (surp-cli) provides the following commands:

Command Description
inspect Show block layout and checksums
pretty Pretty-print in text notation
to-json Convert to JSON
from-json Convert JSON to binary
encode Parse text notation, emit binary
decode Decode binary to text notation
validate Verify checksums and decode integrity
bench Quick encode/decode performance test