Security & design risks

Reporting policy and the project's own honest list of design tradeoffs.

This page combines the project's SECURITY.md and DESIGN_RISKS.md documents, side by side, so security reporting and known design tradeoffs are reachable from a single URL.


Security Policy

Threat Model

Surp is designed to safely handle untrusted input. The decoder is built to resist adversarial documents including:

Attack Vectors & Mitigations

Attack Mitigation
Oversized length fields All varint-decoded lengths are bounds-checked against configurable Limits. Default max block size: 64 MiB, max string: 16 MiB.
Integer overflow LEB128 decoder rejects varints > 10 bytes. ZigZag decoding uses wrapping arithmetic (no UB).
Deep nesting Configurable max_nesting_depth (default: 128, strict: 32). Exceeded depth returns NestingTooDeep error.
Memory exhaustion Per-session max_memory limit (default: 256 MiB). Array/object pre-allocation capped at 1024 elements.
Item count bomb max_items limit (default: 1M) prevents allocation of enormous arrays/objects from small input.
Recursive references Reference wire type currently decoded as integer ID. Full resolution will validate against a bounded reference table.
Malformed varints Decoder rejects truncated varints (UnexpectedEof) and overlong encodings (VarintOverflow).
Invalid UTF-8 All string fields are validated with std::str::from_utf8. Invalid sequences produce InvalidUtf8 error.
Checksum bypass Per-block XXH64 checksums are verified before payload processing. Corrupted blocks are rejected.
Compression bombs Decompression output is bounded by max_block_size. Snappy provides pre-decompression length check.

Resource Limits (Configurable)

rust
1
2
3
4
5
6
7
8
9
use surp_core::Limits;

// For untrusted network input:
let limits = Limits::strict();
// max_nesting_depth: 32
// max_block_size: 1 MiB
// max_items: 10,000
// max_memory: 4 MiB
// max_string_length: 64 KiB

Safe Rust Policy

  • The core encoder/decoder (surp-core) uses 100% safe Rust.
  • The FFI crate (surp-ffi) uses unsafe at the C boundary only, with documented safety contracts.
  • No unsafe in parsing, varint decoding, or checksum computation.

Fuzzing

Fuzzing targets cover:

  • Decoder::decode_all_owned() — arbitrary binary input
  • text::parse() — arbitrary text input
  • Varint decoding — malformed varint sequences
  • Block framing — truncated/corrupted blocks

Run fuzzing:

bash
1
2
cd fuzz
cargo +nightly fuzz run fuzz_decode -- -max_total_time=3600

Reporting Vulnerabilities

If you discover a security vulnerability, please report it privately via GitHub Security Advisories for tubox-labs/surp.

Do NOT open a public issue for security vulnerabilities.

Security Audit Checklist

Before each release:

  • Run cargo audit — no known vulnerabilities
  • Run fuzzing for ≥1 hour with no crashes
  • Review any new unsafe blocks
  • Verify all limits are enforced in tests
  • Check for panics in error paths (should return Result)

Design Risks & Tradeoffs

Format Design Tradeoffs

TLV (Tag-Length-Value) vs Columnar

Choice: TLV with element counts.

  • Pro: Natural fit for streaming and tree-structured data. Easy to skip unknown fields.
  • Pro: Simple implementation, well-understood.
  • Con: Not optimal for analytical workloads (column scans). For analytics, consider Apache Arrow.
  • Mitigation: Index blocks enable random access within a file. String dictionaries provide some columnar benefits.

StartObject/EndObject markers vs Length-Prefixed Objects

Choice: Both — count prefix + end markers.

We encode element counts at the start of objects/arrays (for fast skipping) AND end markers (for streaming validation). This uses ~2 extra bytes per container but provides:

  • Forward skip without recursive descent (use count to skip elements).
  • Streaming validation (end markers confirm structure).
  • Resilience to truncation (missing end marker detected).

Per-Block vs Whole-File Compression

Choice: Per-block.

  • Pro: Random access preserved. Can decode any block independently.
  • Pro: Mixed compression strategies possible (e.g., zstd for text, none for binary blobs).
  • Con: Slightly lower compression ratio than whole-file (no cross-block dictionary).
  • Mitigation: Block sizes can be large (up to 64 MiB default) to amortize overhead.

XXH64 vs CRC32 vs SHA-256

Choice: XXH64 for per-block, XXH64 for file trailer.

  • XXH64: ~30 GB/s throughput, 64-bit hash, excellent collision resistance for non-cryptographic use.
  • CRC32: Weaker collision properties, hardware accelerated but XXH64 is already faster in software.
  • SHA-256: Cryptographic strength unnecessary for data integrity (we're not preventing tampering, just detecting corruption).

String Dictionary: Per-Block vs Global

Choice: Per-block string dictionary.

  • Pro: Each block is self-contained (streamable, seekable).
  • Pro: Dictionary overhead amortized over typical block sizes.
  • Con: Repeated strings across blocks are not deduplicated.
  • Future: Optional global dictionary block (BlockType::StringDict) for files where cross-block dedup matters.

Implementation Risks

Schema Evolution Complexity

Risk: Complex schema changes (renaming fields, changing types) may lead to subtle data loss.

Mitigation: Field IDs are stable. Unknown fields are skipped, not rejected. Type changes require explicit migration. The SurpSchema derive provides schema_info() for programmatic validation.

Zero-Copy Safety

Risk: SurpValue<'a> borrows from the input buffer. If the buffer is deallocated while SurpValue references exist, UB would occur.

Mitigation: Lifetime parameter 'a prevents use-after-free at compile time. This is standard Rust borrow semantics — no unsafe involved.

SIMD Portability

Risk: SIMD code may not compile or may perform poorly on non-x86 architectures.

Mitigation: SIMD is behind the surp-simd feature flag and is entirely optional. All operations have scalar fallbacks.

Endianness

Risk: The format uses little-endian on wire. Big-endian hosts must byte-swap.

Mitigation: All multi-byte integers use to_le_bytes()/from_le_bytes(), which Rust handles correctly on all platforms. Varints (LEB128) are endian-independent by definition.

Performance Risks

Varint Decode Throughput

Risk: LEB128 decoding is branch-heavy and may bottleneck on large varint arrays.

Mitigation: Most field IDs and lengths are small (<128), fitting in 1 byte (no branch misprediction). SIMD batch decoding is planned for bulk varint workloads.

Allocation Pressure

Risk: Decoding into owned Value types allocates many small strings/vectors.

Mitigation: Zero-copy SurpValue<'a> avoids allocation for string-heavy reads. Schema-bound decode (via #[derive(Surp)]) can decode directly into user structs.

Prioritized Optimization Roadmap

  1. String dictionary — implement per-block string table for repeated key dedup.
  2. SIMD varint — batch decode using PEXT/PDEP on x86_64.
  3. Arena allocation — pool allocations for Value trees.
  4. Mmap decoder — memory-mapped file support in surp-io.
  5. Columnar mode — optional columnar block layout for analytics workloads.

Hot Functions to Profile

  • Encoder::encode_value_inner — recursive encoding, measure per-field overhead.
  • Decoder::decode_value_at — recursive decoding, tag dispatch overhead.
  • varint::decode_varint — called per field, must be branch-optimal.
  • compute_xxh64 — called per block, should be negligible.
  • String::from_utf8 / str::from_utf8 — UTF-8 validation on every string.