Security & design risks
Reporting policy and the project's own honest list of design tradeoffs.
This page combines the project's SECURITY.md and DESIGN_RISKS.md documents,
side by side, so security reporting and known design tradeoffs are reachable
from a single URL.
Security Policy
Threat Model
Surp is designed to safely handle untrusted input. The decoder is built to resist adversarial documents including:
Attack Vectors & Mitigations
| Attack | Mitigation |
|---|---|
| Oversized length fields | All varint-decoded lengths are bounds-checked against configurable Limits. Default max block size: 64 MiB, max string: 16 MiB. |
| Integer overflow | LEB128 decoder rejects varints > 10 bytes. ZigZag decoding uses wrapping arithmetic (no UB). |
| Deep nesting | Configurable max_nesting_depth (default: 128, strict: 32). Exceeded depth returns NestingTooDeep error. |
| Memory exhaustion | Per-session max_memory limit (default: 256 MiB). Array/object pre-allocation capped at 1024 elements. |
| Item count bomb | max_items limit (default: 1M) prevents allocation of enormous arrays/objects from small input. |
| Recursive references | Reference wire type currently decoded as integer ID. Full resolution will validate against a bounded reference table. |
| Malformed varints | Decoder rejects truncated varints (UnexpectedEof) and overlong encodings (VarintOverflow). |
| Invalid UTF-8 | All string fields are validated with std::str::from_utf8. Invalid sequences produce InvalidUtf8 error. |
| Checksum bypass | Per-block XXH64 checksums are verified before payload processing. Corrupted blocks are rejected. |
| Compression bombs | Decompression output is bounded by max_block_size. Snappy provides pre-decompression length check. |
Resource Limits (Configurable)
1 2 3 4 5 6 7 8 9
use surp_core::Limits;
// For untrusted network input:
let limits = Limits::strict();
// max_nesting_depth: 32
// max_block_size: 1 MiB
// max_items: 10,000
// max_memory: 4 MiB
// max_string_length: 64 KiB
Safe Rust Policy
- The core encoder/decoder (
surp-core) uses 100% safe Rust. - The FFI crate (
surp-ffi) usesunsafeat the C boundary only, with documented safety contracts. - No
unsafein parsing, varint decoding, or checksum computation.
Fuzzing
Fuzzing targets cover:
Decoder::decode_all_owned()— arbitrary binary inputtext::parse()— arbitrary text input- Varint decoding — malformed varint sequences
- Block framing — truncated/corrupted blocks
Run fuzzing:
1 2
cd fuzz
cargo +nightly fuzz run fuzz_decode -- -max_total_time=3600
Reporting Vulnerabilities
If you discover a security vulnerability, please report it privately via GitHub Security Advisories for tubox-labs/surp.
Do NOT open a public issue for security vulnerabilities.
Security Audit Checklist
Before each release:
- Run
cargo audit— no known vulnerabilities - Run fuzzing for ≥1 hour with no crashes
- Review any new
unsafeblocks - Verify all limits are enforced in tests
- Check for panics in error paths (should return
Result)
Design Risks & Tradeoffs
Format Design Tradeoffs
TLV (Tag-Length-Value) vs Columnar
Choice: TLV with element counts.
- Pro: Natural fit for streaming and tree-structured data. Easy to skip unknown fields.
- Pro: Simple implementation, well-understood.
- Con: Not optimal for analytical workloads (column scans). For analytics, consider Apache Arrow.
- Mitigation: Index blocks enable random access within a file. String dictionaries provide some columnar benefits.
StartObject/EndObject markers vs Length-Prefixed Objects
Choice: Both — count prefix + end markers.
We encode element counts at the start of objects/arrays (for fast skipping) AND end markers (for streaming validation). This uses ~2 extra bytes per container but provides:
- Forward skip without recursive descent (use count to skip elements).
- Streaming validation (end markers confirm structure).
- Resilience to truncation (missing end marker detected).
Per-Block vs Whole-File Compression
Choice: Per-block.
- Pro: Random access preserved. Can decode any block independently.
- Pro: Mixed compression strategies possible (e.g., zstd for text, none for binary blobs).
- Con: Slightly lower compression ratio than whole-file (no cross-block dictionary).
- Mitigation: Block sizes can be large (up to 64 MiB default) to amortize overhead.
XXH64 vs CRC32 vs SHA-256
Choice: XXH64 for per-block, XXH64 for file trailer.
- XXH64: ~30 GB/s throughput, 64-bit hash, excellent collision resistance for non-cryptographic use.
- CRC32: Weaker collision properties, hardware accelerated but XXH64 is already faster in software.
- SHA-256: Cryptographic strength unnecessary for data integrity (we're not preventing tampering, just detecting corruption).
String Dictionary: Per-Block vs Global
Choice: Per-block string dictionary.
- Pro: Each block is self-contained (streamable, seekable).
- Pro: Dictionary overhead amortized over typical block sizes.
- Con: Repeated strings across blocks are not deduplicated.
- Future: Optional global dictionary block (BlockType::StringDict) for files where cross-block dedup matters.
Implementation Risks
Schema Evolution Complexity
Risk: Complex schema changes (renaming fields, changing types) may lead to subtle data loss.
Mitigation: Field IDs are stable. Unknown fields are skipped, not rejected. Type changes require explicit migration. The SurpSchema derive provides schema_info() for programmatic validation.
Zero-Copy Safety
Risk: SurpValue<'a> borrows from the input buffer. If the buffer is deallocated while SurpValue references exist, UB would occur.
Mitigation: Lifetime parameter 'a prevents use-after-free at compile time. This is standard Rust borrow semantics — no unsafe involved.
SIMD Portability
Risk: SIMD code may not compile or may perform poorly on non-x86 architectures.
Mitigation: SIMD is behind the surp-simd feature flag and is entirely optional. All operations have scalar fallbacks.
Endianness
Risk: The format uses little-endian on wire. Big-endian hosts must byte-swap.
Mitigation: All multi-byte integers use to_le_bytes()/from_le_bytes(), which Rust handles correctly on all platforms. Varints (LEB128) are endian-independent by definition.
Performance Risks
Varint Decode Throughput
Risk: LEB128 decoding is branch-heavy and may bottleneck on large varint arrays.
Mitigation: Most field IDs and lengths are small (<128), fitting in 1 byte (no branch misprediction). SIMD batch decoding is planned for bulk varint workloads.
Allocation Pressure
Risk: Decoding into owned Value types allocates many small strings/vectors.
Mitigation: Zero-copy SurpValue<'a> avoids allocation for string-heavy reads. Schema-bound decode (via #[derive(Surp)]) can decode directly into user structs.
Prioritized Optimization Roadmap
- String dictionary — implement per-block string table for repeated key dedup.
- SIMD varint — batch decode using PEXT/PDEP on x86_64.
- Arena allocation — pool allocations for
Valuetrees. - Mmap decoder — memory-mapped file support in
surp-io. - Columnar mode — optional columnar block layout for analytics workloads.
Hot Functions to Profile
Encoder::encode_value_inner— recursive encoding, measure per-field overhead.Decoder::decode_value_at— recursive decoding, tag dispatch overhead.varint::decode_varint— called per field, must be branch-optimal.compute_xxh64— called per block, should be negligible.String::from_utf8/str::from_utf8— UTF-8 validation on every string.