This page combines the project's SECURITY.md and DESIGN_RISKS.md documents, side by side, so security reporting and known design tradeoffs are reachable from a single URL.

Security Policy

Threat Model

Surp is designed to safely handle untrusted input. The decoder is built to resist adversarial documents including:

Attack Vectors & Mitigations

Attack	Mitigation
Oversized length fields	All varint-decoded lengths are bounds-checked against configurable `Limits`. Default max block size: 64 MiB, max string: 16 MiB.
Integer overflow	LEB128 decoder rejects varints > 10 bytes. ZigZag decoding uses wrapping arithmetic (no UB).
Deep nesting	Configurable `max_nesting_depth` (default: 128, strict: 32). Exceeded depth returns `NestingTooDeep` error.
Memory exhaustion	Per-session `max_memory` limit (default: 256 MiB). Array/object pre-allocation capped at 1024 elements.
Item count bomb	`max_items` limit (default: 1M) prevents allocation of enormous arrays/objects from small input.
Recursive references	Reference wire type currently decoded as integer ID. Full resolution will validate against a bounded reference table.
Malformed varints	Decoder rejects truncated varints (`UnexpectedEof`) and overlong encodings (`VarintOverflow`).
Invalid UTF-8	All string fields are validated with `std::str::from_utf8`. Invalid sequences produce `InvalidUtf8` error.
Checksum bypass	Per-block XXH64 checksums are verified before payload processing. Corrupted blocks are rejected.
Compression bombs	Decompression output is bounded by `max_block_size`. Snappy provides pre-decompression length check.

Resource Limits (Configurable)

rust

use surp_core::Limits;

// For untrusted network input:
let limits = Limits::strict();
// max_nesting_depth: 32
// max_block_size: 1 MiB
// max_items: 10,000
// max_memory: 4 MiB
// max_string_length: 64 KiB

Safe Rust Policy

The core encoder/decoder (surp-core) uses 100% safe Rust.
The FFI crate (surp-ffi) uses unsafe at the C boundary only, with documented safety contracts.
No unsafe in parsing, varint decoding, or checksum computation.

Fuzzing

Fuzzing targets cover:

Decoder::decode_all_owned() — arbitrary binary input
text::parse() — arbitrary text input
Varint decoding — malformed varint sequences
Block framing — truncated/corrupted blocks

Run fuzzing:

bash

1
2

cd fuzz
cargo +nightly fuzz run fuzz_decode -- -max_total_time=3600

Reporting Vulnerabilities

If you discover a security vulnerability, please report it privately via GitHub Security Advisories for tubox-labs/surp.

Do NOT open a public issue for security vulnerabilities.

Security Audit Checklist

Before each release:

Run cargo audit — no known vulnerabilities
Run fuzzing for ≥1 hour with no crashes
Review any new unsafe blocks
Verify all limits are enforced in tests
Check for panics in error paths (should return Result)

Design Risks & Tradeoffs

Format Design Tradeoffs

TLV (Tag-Length-Value) vs Columnar

Choice: TLV with element counts.

Pro: Natural fit for streaming and tree-structured data. Easy to skip unknown fields.
Pro: Simple implementation, well-understood.
Con: Not optimal for analytical workloads (column scans). For analytics, consider Apache Arrow.
Mitigation: Index blocks enable random access within a file. String dictionaries provide some columnar benefits.

StartObject/EndObject markers vs Length-Prefixed Objects

Choice: Both — count prefix + end markers.

We encode element counts at the start of objects/arrays (for fast skipping) AND end markers (for streaming validation). This uses ~2 extra bytes per container but provides:

Forward skip without recursive descent (use count to skip elements).
Streaming validation (end markers confirm structure).
Resilience to truncation (missing end marker detected).

Per-Block vs Whole-File Compression

Choice: Per-block.

Pro: Random access preserved. Can decode any block independently.
Pro: Mixed compression strategies possible (e.g., zstd for text, none for binary blobs).
Con: Slightly lower compression ratio than whole-file (no cross-block dictionary).
Mitigation: Block sizes can be large (up to 64 MiB default) to amortize overhead.

XXH64 vs CRC32 vs SHA-256

Choice: XXH64 for per-block, XXH64 for file trailer.

XXH64: ~30 GB/s throughput, 64-bit hash, excellent collision resistance for non-cryptographic use.
CRC32: Weaker collision properties, hardware accelerated but XXH64 is already faster in software.
SHA-256: Cryptographic strength unnecessary for data integrity (we're not preventing tampering, just detecting corruption).

String Dictionary: Per-Block vs Global

Choice: Per-block string dictionary.

Pro: Each block is self-contained (streamable, seekable).
Pro: Dictionary overhead amortized over typical block sizes.
Con: Repeated strings across blocks are not deduplicated.
Future: Optional global dictionary block (BlockType::StringDict) for files where cross-block dedup matters.

String dictionary — implement per-block string table for repeated key dedup.
SIMD varint — batch decode using PEXT/PDEP on x86_64.
Arena allocation — pool allocations for Value trees.
Mmap decoder — memory-mapped file support in surp-io.
Columnar mode — optional columnar block layout for analytics workloads.

Hot Functions to Profile

Encoder::encode_value_inner — recursive encoding, measure per-field overhead.
Decoder::decode_value_at — recursive decoding, tag dispatch overhead.
varint::decode_varint — called per field, must be branch-optimal.
compute_xxh64 — called per block, should be negligible.
String::from_utf8 / str::from_utf8 — UTF-8 validation on every string.

Security & design risks

Security Policy

Threat Model

Attack Vectors & Mitigations

Resource Limits (Configurable)

Safe Rust Policy

Fuzzing

Reporting Vulnerabilities

Security Audit Checklist

Design Risks & Tradeoffs

Format Design Tradeoffs

TLV (Tag-Length-Value) vs Columnar

StartObject/EndObject markers vs Length-Prefixed Objects

Per-Block vs Whole-File Compression

XXH64 vs CRC32 vs SHA-256

String Dictionary: Per-Block vs Global

Implementation Risks

Schema Evolution Complexity

Zero-Copy Safety

SIMD Portability

Endianness

Performance Risks

Varint Decode Throughput

Allocation Pressure

Prioritized Optimization Roadmap

Hot Functions to Profile