v1 Binary Format
The stable compatibility surface. Blocks, types, varints, checksums.
1. Overview
Surp is a compact, canonical binary serialization format designed as an alternative to JSON. It provides deterministic encoding, schema evolution support, and both human-readable and binary representations.
Design principles:
- Safety and correctness over micro-optimizations
- Deterministic encoding (same data → same bytes, always)
- Zero-copy decoding when possible
- Forward/backward compatible schema evolution
- Streaming and random-access support
2. File Layout
File layout
2.1 Block Header
Block header anatomy
| Field | Size | Description |
|---|---|---|
| block_type | 1 byte | Block type ID (see §2.3) |
| block_len | varint | Length of payload in bytes |
| comp_type | 1 byte | Compression algorithm ID |
| checksum | 8 bytes | XXH64 of uncompressed payload (LE) |
| payload | N bytes | Block data (N = block_len) |
2.2 Block Types
| ID | Name | Description |
|---|---|---|
| 0x01 | Data | Encoded value data |
| 0x02 | Index | Offset index for random access |
| 0x03 | Schema | Embedded schema information |
| 0x04 | StringDict | String dictionary for deduplication |
| 0xFF | Trailer | File-level checksum (last block) |
2.4 Compression Types
| ID | Name | Description |
|---|---|---|
| 0x00 | None | No compression |
| 0x01 | Zstd | Zstandard compression |
| 0x02 | Snappy | Snappy compression |
| 0x03 | LZ4 | LZ4 frame compression |
Adaptive compression (feature
lz4/zstd/snappy): the encoder can sample the first N bytes of a block and select the algorithm that achieves the best ratio above a configurable threshold. SeeAdaptiveSelectorin surp-compression.
2.5 Compressed Block Wire Format
When a block is compressed, the on-wire payload is:
1
uncompressed_len(varint) | compressed_data
- The
block_lenfield in the block header reflects the compressed size (including the varint prefix). - The
checksumis computed on the uncompressed payload, so integrity is verified after decompression. - If the compressed output is not smaller than the original, the encoder falls back to
comp_type = 0x00(None).
2.6 Decode Paths
- Zero-copy (
decode_next()→SurpValue<'a>): Borrows from the input slice. Rejects compressed blocks with a descriptive error. - Owned (
decode_next_owned()→Value): Works transparently with both compressed and uncompressed blocks. Decompresses into an internal buffer when needed.
3. Wire Types
Each value is prefixed with a tag byte:
- Low 4 bits: wire type
- High 4 bits: flags
| ID | Wire Type | Payload |
|---|---|---|
| 0x00 | Null | None |
| 0x01 | Bool | 1 byte (0x00=false, 0x01=true) |
| 0x02 | VarUInt | LEB128 unsigned integer |
| 0x03 | VarInt | ZigZag + LEB128 signed integer |
| 0x04 | Fixed64 | 8 bytes little-endian (f64) |
| 0x05 | LenDelimited | sub-type(1) + len(varint) + data |
| 0x06 | StartObject | count(varint) + fields... |
| 0x07 | EndObject | None |
| 0x08 | StartArray | count(varint) + items... |
| 0x09 | EndArray | None |
| 0x0A | Reference | ref_id(varint) |
3.1 LenDelimited Sub-types
| ID | Sub-type | Description |
|---|---|---|
| 0x00 | String | UTF-8 encoded string |
| 0x01 | Bytes | Raw binary data |
3.2 Object Encoding
Object encoding
3.3 Array Encoding
Array encoding
4. Integer Encoding
4.1 Unsigned: LEB128
Variable-length encoding: 7 bits of data per byte, MSB indicates continuation.
1 2 3 4 5 6
Value Encoded bytes
0 00
127 7F
128 80 01
300 AC 02
16384 80 80 01
4.2 Signed: ZigZag + LEB128
ZigZag maps signed to unsigned: 0→0, -1→1, 1→2, -2→3, 2→4, ...
Formula: encode(n) = (n << 1) ^ (n >> 63)
5. Checksums
5.1 Per-Block: XXH64
Every block includes an 8-byte XXH64 hash of its uncompressed payload (seed=0, little-endian).
Why XXH64: ~30 GB/s throughput, excellent collision resistance for non-cryptographic integrity checks. Adds < 0.1% overhead to typical workloads.
5.1.1 Alternative checksum algorithms (feature flags)
| Algorithm | Feature flag | Performance | Notes |
|---|---|---|---|
| XXH64 | (default) | ~30 GB/s | Default, best-tested |
| XXH3-64 | xxh3 |
~50 GB/s | Newer, SIMD-optimized |
| CRC32 | compat-crc32 |
~10 GB/s | Legacy compatibility |
The ChecksumAlgo enum provides a unified API for switching at runtime.
The wire format always stores an 8-byte checksum field; CRC32 values are
zero-extended to 8 bytes for backward compatibility.
5.2 File Trailer
The trailer block contains an XXH64 hash of all preceding bytes. This detects file-level truncation or corruption.
6. Endianness
Canonical wire format: little-endian.
All multi-byte fixed-width integers (f64, checksums) are stored in little-endian. LEB128 varints are byte-order independent by definition. On big-endian hosts, byte-swap operations are inserted by Rust's to_le_bytes()/from_le_bytes().
7. Schema Evolution
7.1 Field IDs
When using #[derive(Surp)], each field gets a stable integer ID via #[surp(id = N)]. Fields are matched by name in schema-less mode and by ID in schema-on-write mode.
7.2 Compatible Changes (minor version)
- Adding new optional fields (with new IDs)
- Adding new wire types with defined skip semantics
7.3 Incompatible Changes (major version)
- Changing block framing
- Changing existing wire type semantics
- Removing the ability to skip unknown fields
7.4 Unknown Field Skipping
Decoders MUST be able to skip unknown wire types:
- Null/Bool/End*: fixed size, trivially skipped.
- VarUInt/VarInt: skip varint bytes.
- Fixed64: skip 8 bytes.
- LenDelimited: read length, skip that many bytes.
- StartObject/StartArray: read count, recursively skip children.
- Reference: skip varint ref_id.
8. String Dictionary (Per-Block)
Within a data block, repeated strings can be stored in a dictionary table. Subsequent occurrences reference the dictionary by index using the Reference wire type.
Algorithm (Encoder):
- When
enable_dedup()is called, the encoder maintains a per-blockHashMap<String, u32>. - First occurrence: encode string normally (LenDelimited + string data), record in the map with the next sequential index.
- Subsequent occurrences: encode as Reference wire type (0x0A) with the dictionary index.
- On
flush_block(), if the dictionary is non-empty, emit a StringDict block (type 0x04) before the data block. - The dictionary is cleared between blocks.
8.1 StringDict Block Format
The StringDict block (type 0x04) is emitted immediately before its corresponding data block. It uses standard block framing:
1
block_type(0x04) | block_len(varint) | comp_type(0x00) | checksum(8B) | payload
Payload layout:
1
entry_count(varint) | entry₀ | entry₁ | ... | entryₙ₋₁
Each entry uses prefix-delta compression (entries are sorted lexicographically):
1
original_index(varint) | prefix_len(varint) | suffix_len(varint) | suffix_bytes
original_index: The insertion-order index matching Reference wire type IDs.prefix_len: Number of bytes shared with the previous entry (0 for the first).suffix_len: Length of the non-shared suffix.suffix_bytes: The raw suffix bytes.
8.2 Decoder Handling
When the decoder encounters a StringDict block:
- Verify the block checksum.
- Parse the prefix-delta entries, reconstructing full strings.
- Populate the per-block string table in insertion order using
original_index. - Proceed to the next block (which should be a Data block).
- During data block decoding, Reference wire types resolve from the pre-populated table.
The StringDict block is consumed transparently — callers of decode_next() / decode_next_owned() never see it.
8.3 Prefix-Delta Compression
Entries in the StringDict block are sorted lexicographically before encoding. Each entry stores only the suffix that differs from the previous entry:
StringDict prefix-delta
This reduces dictionary overhead for datasets with structured/hierarchical key names.
9. Reference/Dedup
The Reference wire type (0x0A) currently encodes a varint index into the
per-block string dictionary described in section 8. The implemented encoder
uses it only for repeated UTF-8 strings when Encoder::enable_dedup() is set.
Reference IDs are scoped to a single block. Cross-block references are not supported (blocks are self-contained).
Structural subtree deduplication is not implemented in the v1 encoder or decoder. Future versions may add a separate reference table for that behavior, but files written by the current implementation must treat references as string-dictionary references.
10. Streaming vs Block Mode
Streaming Mode
- Blocks are emitted as data arrives.
- No index block.
- Reader processes blocks sequentially.
Block Mode
- Entire document encoded into one or more data blocks.
- Optional index block at end for random access.
- Optional schema block for self-describing files.
11. Security Considerations
See SECURITY.md for the full threat model.
Key points:
- All lengths are bounds-checked against configurable limits.
- Nesting depth is limited (default: 128).
- Varint decoder rejects overlong encodings.
- UTF-8 validation on all strings.
- Per-block checksums prevent processing corrupted data.
- Decompression output is bounded.
skip_value_at()enforces Limits (nesting + item count) during skip.
12. Feature Flags
All optimizations are gated behind Cargo feature flags to keep the default binary small and compilation fast.
12.1 surp-core features
| Flag | Dependencies | Description |
|---|---|---|
xxh3 |
— | Use XXH3-64 for checksums (faster SIMD path) |
compat-crc32 |
crc32fast |
CRC32 checksum support for legacy interop |
fast-alloc |
bumpalo |
Per-block bump allocator via BumpDecoder |
lz4 |
lz4_flex |
LZ4 block compression support |
zstd |
zstd |
Zstandard block compression support |
snappy |
snap |
Snappy block compression support |
12.2 surp-io features
| Flag | Dependencies | Description |
|---|---|---|
mmap |
memmap2 |
Memory-mapped zero-copy file reader (MmapReader) |
12.3 surp-compression features
| Flag | Dependencies | Description |
|---|---|---|
lz4 |
lz4_flex |
LZ4 compression support |
zstd |
zstd |
Zstandard compression |
snappy |
snap |
Snappy compression |
12.4 surp-simd features
| Flag | Dependencies | Description |
|---|---|---|
simd-varint |
— | NEON SIMD-accelerated varint boundary pre-scan |
13. Text Format
The Surp text format is a deterministic, human-readable notation that maps
1:1 to the binary format. See docs/TEXT_FORMAT.abnf for the normative ABNF
grammar (RFC 5234).
Key differences from JSON:
- Object fields terminated by
;not, - Binary literals:
b64#<base64>; object fields still use the normal field; - Optional type annotations:
42::u32 - Comments:
// lineand/* block */ - Signed integers use explicit
+/-prefix
14. CLI Reference
The surp CLI tool (surp-cli) provides the following commands:
| Command | Description |
|---|---|
inspect |
Show block layout and checksums |
pretty |
Pretty-print in text notation |
to-json |
Convert to JSON |
from-json |
Convert JSON to binary |
encode |
Parse text notation, emit binary |
decode |
Decode binary to text notation |
validate |
Verify checksums and decode integrity |
bench |
Quick encode/decode performance test |