Pipeline Configuration
A Pipeline in DeepZero is a declarative execution graph defining a continuous, resilient data transformation process. It translates a raw physical dataset into a high-signal analytical dataset.
The pipeline schema is rigorously defined in YAML. DeepZero resolves configuration dynamically, supporting shell-native variable expansion (e.g., ${VAR:-default}).
Configuration Schema
name: my_pipeline
description: Standard vulnerability research pipeline
version: "1.0"
model: openai/gpt-4o # Default LiteLLM integration target
settings:
work_dir: work
max_workers: 8 # Global ceiling on ThreadPoolExecutor thread limits
stages:
# Stage 1: MUST be an IngestProcessor
- name: discover
processor: file_discovery
config:
extensions: ["*"]
# Stage 2: Synchronous Filter
- name: filter
processor: metadata_filter
config:
require:
is_executable: true
# Stage 3: High-latency Map processing
- name: decompile
processor: ghidra_decompile/ghidra_decompile.py
parallel: 4 # Restricts concurrency mapping to 4 concurrent Ghidra JVMs
timeout: 300 # Enforces a strict 300-second kill clock per sample via process.py
on_failure: skip # Handles exceptions silently rather than aborting (skip, retry, abort)
max_retries: 2 # Retry logic execution constraint
config:
ghidra_install_dir: ${GHIDRA_INSTALL_DIR}
Stage Options
Every stage defined under the stages: array accepts the following attributes:
| Field | Type | Default | Description |
|---|---|---|---|
name |
string | stage_N |
Unique stage name within the pipeline |
processor |
string | required | Processor reference (see below) |
config |
dict | {} |
Processor-specific configuration |
parallel |
int | 4 |
Concurrency for Map processors. 0 auto-scales to os.cpu_count() |
timeout |
int | 0 |
Per-sample timeout in seconds (0 = no timeout) |
on_failure |
string | skip |
Defines fault-tolerance behavior: skip, retry, or abort |
max_retries |
int | 0 |
Retry count when on_failure: retry |
Resolution Logic (engine/pipeline.py)
When invoked via the CLI, the parser attempts to resolve processors in strict hierarchical order:
- Path Resolvers: Direct file paths terminating in
.py(e.g.,processors/ghidra/ghidra.py:Decompiler). - Directory Lookup:
pipeline/my_pipeline/processors/. - Internal Registry: System built-ins explicitly registered in
stages/__init__.py. - Dotted Python Import: Dynamically evaluated modules (e.g.,
my.python.module:MyClass).
Dynamic Expansion
Before schema validation binds processors, DeepZero walks the entire YAML DOM tree, evaluating Regex matches against \$\{([^}]+)\}. Environment variables dictate the resolved runtime configuration. This explicitly prevents hardcoding API keys or installation directories within committed .yaml pipelines.