Architecture

A serializer drawn from four neat lanes.

Surp's repository is a Cargo workspace of nine crates, a Python package, and a small set of fixtures and benches. Underneath the surface area, one codec carries the load — and one wire format is the only stable contract.

Bird's-eye view

The workspace at a glance

The codec lives in surp-core. Public surfaces (surp-cli,surp-python, surp-ffi, surp-derive) all depend on it, never the other way around. IO and storage adapters (surp-io, surp-compression, surp-simd) sit beside the codec and are pulled in via Cargo features. The RFC-001 work lives entirely inside surp_core::rfc001, with its own namespace and its own file extension — never mixed with v1 .surp bytes.

SURFACESCODECIO & TRANSPORTSTORAGEsurp-clibinary toolsurp-pythonPyO3 modulesurp-ffiC ABI helperssurp-deriveSurp / SurpSchemasurp-coreEncoder · Decoder · Value · textthe heart of the workspacerfc001CTN · CBF · CQLlimitsdepth · size · countchecksumXXH64 · XXH3varintLEB128surp-iotokio framed · mmapsurp-compressionzstd · lz4 · snappysurp-simdvarint pre-scanbenchcriterion.surp fileblock-framed v1.crb fileRFC-001 CBF.ctn fixturetext notation↑ public surfaces depend on the codec, never the other way around↑ all writers emit block-framed v1 by default
Four lanes — surfaces, codec, IO & transport, storage — mirror the workspace crate layout. Arrows show dependency direction; the codec never imports from a surface.
Encode pipeline

From a value tree to a checked file

An encode never skips a stage. The encoder walks a Value tree and emits scalars with type tags and varint-encoded lengths. When string deduplication is on, repeated strings are interned into a dedup table that sits inside the same block. The block writer then prefixes the payload with a type byte, the payload length, a compression-type byte, and an XXH64 checksum of the uncompressed payload. A trailer block carries the overall checksum; readers verify both before exposing any value.

Value treeValue / SurpValueEncodervarint · dedupBlock writertype · len · compChecksumXXH64 per blockTraileroverall checksumvalue treescalar & container opsframed payloadintegrityfile end
The encode path. Decode is the same diagram in reverse — checksums are verified before any payload is exposed to caller code.
Trust boundary

Where untrusted bytes become a value

The decoder is the only piece of code allowed to look at raw bytes. Limits (max depth, max element counts, max payload sizes) are enforced before allocation; checksum verification fails closed; corrupt or oversized inputs never reach a constructed Value. The Rust API exposes two value flavors: Value for owned trees, and SurpValue<'a> for borrowed zero-copy decode of uncompressed v1 data.

UNTRUSTED INPUTVALIDATED VALUE TREEbytes.surp / .crbDecoderchecks limitschecksum verificationXXH64 — fail closedValue treeownedSurpValue<'a>borrowedcaller codenever sees raw bytes
The decoder is the only bridge between untrusted bytes and caller code. Limits and checksums fail closed before any value is constructed.
Data ownership

One format, two flavors of decode

Owned — Value

Allocates and owns its children. Use when you want a long-lived tree, mutation, or to ship the value across thread boundaries. Always available, including for compressed payloads.

Borrowed — SurpValue<'a>

Zero-copy view tied to the original byte buffer. Available for uncompressed v1 data. Pay nothing on decode; pay only when you ask a field for an owned string or array.

Subsystem notes

What each crate is responsible for

  • surp-coreThe codec: encoder, decoder, value tree, block framing, text notation, resource limits, and the RFC-001 modules.
  • surp-derive#[derive(Surp)] and #[derive(SurpSchema)] for named Rust structs; stable numeric field IDs for forward-compatible schema evolution.
  • surp-cliThe surp binary tool. Verb-driven; converts JSON↔v1, encodes/decodes the text notation, inspects, validates, runs CLI benches and the RFC-001 commands.
  • surp-pythonPyO3 extension that exports the Python package named surp; ships SurpValue views, Encoder/SurpDecoder, and the surp.model RFC-001 validation layer.
  • surp-ioTokio framed IO, shared buffers via the bytes crate, optional mmap reader for memory-mapped decode.
  • surp-compressionCompression trait and optional zstd, lz4, and snappy adapters. All three are feature-gated; none are required.
  • surp-ffiC ABI helpers — JSON-to-Surp and Surp-to-JSON buffer entry points for embedding in non-Rust hosts.
  • surp-simdScalar-safe scanning helpers and an optional aarch64 SIMD varint pre-scan path.
  • benchCriterion-driven Rust and Python benchmark harnesses with deterministic datasets and committed result fixtures.
  • fuzzcargo-fuzz targets and corpora for the decoder, the text parser, varints, block framing, and full roundtrips. Excluded from the workspace build by design.
Design choices

Why this shape, and what it doesn't try to be

Three explicit tradeoffs steer the design. Safety over micro-optimization:checksums are verified before payloads are exposed, and resource limits sit between input and allocation. Determinism: the same input value produces the same bytes, every time — a property that makes diffing, content-addressing, and replay tractable. Schema evolution as a first-class feature: the derive macros encode stable numeric field IDs, and unknown fields are skipped on decode, so old readers gracefully ignore new fields.

Surp is not a streaming-only format and not an in-place editable format. It is a canonical container for value trees with optional random access via the index block. Anything that looks like a database, an RPC framework, or a schema registry is out of scope.