Persistence Subsystem

Data integrity is handled by the StateStore (defined in engine/state.py). Operations executing against the filesystem are heavily mitigated against corruption, even against SIGKILL or OS-level interruption, via strict atomic swapping methodologies.

Memory SampleState Serialize state.json.tmp Buffer os.replace Atomic Swap state.json Persistent

Atomic Swaps (atomic_replace)

Instead of writing directly to expected artifacts (e.g., state.json), the persistence layer marshals data into temporary .tmp buffers. Once serialization completes, an os.replace operation forcibly overwrites the destination atomic inode. A retry-backoff mechanism actively intercepts PermissionError locks triggered by host EDR/Antivirus heuristics scanning newly created binaries.

Schema Versioning

All state objects are tagged with an internal STATE_VERSION. If the engine encounters schema drift during deserialization (e.g., reading a v1 JSON under a v2 runtime), it explicitly deprecates the state object rather than inducing unpredictable mutation bugs.

Workspace Hierarchy

work/<pipeline_name>/
โ”œโ”€โ”€ run.json             # Serialized `RunState`: Execution metrics and pipeline metadata
โ”œโ”€โ”€ pipeline.yaml        # Immutable snapshot of the pipeline YAML config during initialization
โ”œโ”€โ”€ run_manifest.json    # Aggregated macro-overview of all samples (used by Starlette API)
โ””โ”€โ”€ samples/
    โ””โ”€โ”€ <sample_id>/     # Isolated sandbox
        โ”œโ”€โ”€ state.json   # Serialized `SampleState`: Complete ledger of `StageOutput` maps
        โ”œโ”€โ”€ context.md   # Synthesized LLM context generated via `engine/context.py`
        โ””โ”€โ”€ ...          # Processor-specific artifacts

The history Ledger

The SampleState.history dictionary strictly maps processor stage names to instances of StageOutput. Output data structures are strictly isolated and namespaced. A downstream processor extracting metrics from an upstream map will query: history["upstream_processor"].data.get("metric") (see Building Custom Processors).

This namespacing permanently prevents field collision across divergent heuristic analysis techniques.