Persistence Subsystem
Data integrity is handled by the StateStore (defined in engine/state.py). Operations executing against the filesystem are heavily mitigated against corruption, even against SIGKILL or OS-level interruption, via strict atomic swapping methodologies.
Atomic Swaps (atomic_replace)
Instead of writing directly to expected artifacts (e.g., state.json), the persistence layer marshals data into temporary .tmp buffers. Once serialization completes, an os.replace operation forcibly overwrites the destination atomic inode. A retry-backoff mechanism actively intercepts PermissionError locks triggered by host EDR/Antivirus heuristics scanning newly created binaries.
Schema Versioning
All state objects are tagged with an internal STATE_VERSION. If the engine encounters schema drift during deserialization (e.g., reading a v1 JSON under a v2 runtime), it explicitly deprecates the state object rather than inducing unpredictable mutation bugs.
Workspace Hierarchy
work/<pipeline_name>/
โโโ run.json # Serialized `RunState`: Execution metrics and pipeline metadata
โโโ pipeline.yaml # Immutable snapshot of the pipeline YAML config during initialization
โโโ run_manifest.json # Aggregated macro-overview of all samples (used by Starlette API)
โโโ samples/
โโโ <sample_id>/ # Isolated sandbox
โโโ state.json # Serialized `SampleState`: Complete ledger of `StageOutput` maps
โโโ context.md # Synthesized LLM context generated via `engine/context.py`
โโโ ... # Processor-specific artifacts
The history Ledger
The SampleState.history dictionary strictly maps processor stage names to instances of StageOutput. Output data structures are strictly isolated and namespaced. A downstream processor extracting metrics from an upstream map will query: history["upstream_processor"].data.get("metric") (see Building Custom Processors).
This namespacing permanently prevents field collision across divergent heuristic analysis techniques.