The Relay Method: A Formal Protocol for Multi-Session Collaboration with LLM Agents in Production Contexts

1. The Continuity Problem as an Architectural Constraint

Contemporary Large Language Models operate in a perpetual present. Every interaction session begins with no memory of the previous one, every context is deallocated at conversation's end, and every model instance is functionally a new entity with no access to the knowledge, decisions, and artifacts produced by its predecessors. This is not an implementation defect: it is a direct consequence of the transformer architecture (Vaswani et al., 2017), where the model's input is entirely contained within the current context window and no native persistent state mechanism exists between distinct sessions.

For brief, self-contained tasks — a translation, a local refactoring, a factual answer — this constraint is irrelevant. But for complex engineering tasks spanning hours, days, or weeks — developing a distributed software architecture, conducting a security audit on a multi-server infrastructure, iteratively building a framework with cross-dependencies — the absence of continuity becomes the dominant bottleneck. Project state is lost. Architectural decisions made in session N are ignored or contradicted in session N+1. Errors committed and corrected in session N-2 are repeated because the current instance has no access to the error record and its correction. The human researcher finds themselves in the paradoxical condition of having to re-educate their AI collaborator at every session, spending a growing share of their cognitive and computational budget to bring the system back to where it had already arrived.

This problem has been recognized by the research community, but the solutions proposed thus far address partial aspects of the phenomenon without proposing an integrated methodological protocol.

2. Critical Analysis of Existing Solutions

The analysis of literature on agentic frameworks reveals five principal approaches to memory and continuity management, each with significant contributions but also structural limitations reducing their applicability to production contexts where correctness and auditability are primary requirements.

ReAct (Yao et al., 2023), published at ICLR, introduces an elegant paradigm of interleaving between reasoning and action, allowing the model to formulate plans, execute actions, and observe results in iterative cycles. However, the framework operates entirely within a single session and provides no mechanisms for state persistence across distinct sessions. ReAct's reasoning trace is ephemeral: at session end, the record of reasoning and actions is lost.

AutoGPT (Significant Gravitas, 2023) represents the most ambitious attempt at complete agent autonomy, with a planning-execution-evaluation loop operating with minimal human supervision. However, operational experience with AutoGPT has documented a systematic decision drift problem: in the absence of explicit methodological constraints, the agent tends to progressively deviate from initial objectives, generating non-pertinent sub-goals and consuming computational resources in tangential activities. The framework provides no audit mechanisms enabling the human supervisor to verify a posteriori the coherence of the agent's decisions.

Reflexion (Shinn et al., 2023), published at NeurIPS, introduces a self-reflection mechanism where the agent uses feedback from its own errors to improve subsequent performance. The contribution is significant, but the reflective mechanism operates within a single session and does not survive conversation termination. Lessons learned by the agent in one session are lost in the next — a problem the Relay Method explicitly addresses with the FIELD_NOTES component.

MemGPT (Packer et al., 2023) proposes an architecture inspired by operating system virtual memory, with automatic pagination between main context and archival storage. The intuition is powerful — treating context as a paginable resource rather than a monolithic buffer — but the framework treats memory content as flat data without semantic structure. It does not distinguish between facts, decisions, operational lessons, and project state. It provides no validation mechanisms for paginated content. And it does not offer a handover protocol between sessions guaranteeing coherence of the transition.

Voyager (Wang et al., 2023), developed in a Minecraft environment, introduces a persistent verified skill library where the agent accumulates verified programs reusable in subsequent sessions. The idea of verified persistence is valuable, but the framework is designed for simulated environments with deterministic state and is not directly transferable to real-world software engineering contexts where project state is distributed across filesystem, databases, services, network configurations, and deployment pipelines.

The gap common to all these frameworks is the absence of a methodological layer between the technical framework and the specific application. It is not merely about what the agent remembers or how it paginates its memory: it is about how the handover is structured, verified, and made auditable. The Relay Method fills exactly this gap.

3. The Three-Layer Persistent Architecture

The Relay Method proposes an architecture founded on three orthogonal persistent layers, each responsible for a specific aspect of operational continuity. The choice of three independent layers, rather than a single monolithic solution, is motivated by the separation of concerns principle: each layer can be implemented, replaced, and maintained independently from the others.

3.1 Knowledge Graph (Baton Relay)

The first layer is a typed knowledge graph serving as the system's long-term semantic memory. The graph comprises typed nodes (people, projects, decisions, operational lessons, concepts, organizations) connected by weighted edges with context. Each node can contain multiple knowledge entries, each with timestamp, source, and generating AI model.

Unlike a document database or a vector store, the knowledge graph maintains structural relationships between knowledge elements. Architectural decision D7 made in session S3 is connected to project X, to person Y who motivated it, and to lesson L12 that confirmed it. This relational network enables contextualized semantic retrieval (implemented through the smart_context tool) producing results qualitatively superior to pure vector retrieval.

The graph is also equipped with native diagnostic tools: find_contradictions identifies mutually incompatible assertions, find_gaps detects isolated nodes and under-represented layers, find_stale_nodes flags potentially obsolete information. Every mutation is individually rollbackable through the rollback_mutation tool, providing a granular undo mechanism absent in most agent memory systems.

3.2 Operational Control (MCP Framework)

The second layer is an operational control server exposing the project's entire infrastructure to the LLM agent through the Model Context Protocol (MCP). The layer currently includes 106 tools organized by functional domain: filesystem, shell, git, database (MySQL, PostgreSQL, SQLite), TLS, Docker, nginx, firewall, cron, syntactic validation (PHP, Python), and systemd service management.

The architecturally most significant component of this layer is not the tools themselves — which could be implemented with any function calling framework — but the boot protocol. At each session's start, the mcp_boot function automatically loads into the agent's context: the project's operational manual (current version of operative rules), current state (CURRENT_STATE), field notes (FIELD_NOTES with accumulated lessons), and the previous session's handover. This automatic boot eliminates the "cold session" problem: the agent does not start from zero but from the exact position where the previous session concluded.

The layer also includes a validator system with blocking power. Validators implement structural checks (project structure coherence, file taxonomy conformity, retention policy validity for destructive operations) and can issue a BLOCK verdict preventing operation execution. The principle is that certain operations — deletions, production configuration modifications, deployments — must not be executable without passing explicit checks, regardless of the "trust" placed in the agent.

3.3 Working State Mirror

The third layer resolves a specific problem of the LLM agent sandbox architecture: the agent's working environment is ephemeral. The sandbox is destroyed at session end, and in some cases even during the session itself — a phenomenon I have experimentally documented as a "compaction event," where the platform summarizes older messages, frees context space, and terminates any processes running in the sandbox.

The Working State Mirror is a synchronization mechanism that periodically replicates artifacts produced by the agent to persistent storage external to the sandbox. Synchronization is based on MD5 comparison (transcript_index) and selective upload of only new or modified files (transcript_save). The mirror operates at regular intervals during the session and mandatorily at session end, ensuring no artifact produced by the agent is lost even in case of abnormal termination.

4. The Formal Protocol

The Relay Method defines a formal protocol composed of three rule categories governing agent behavior and inter-session handover.

4.1 Operational Rules (R1-R7)

Operational rules codify software engineering principles adapted to the human-AI collaboration context. R1 (backup before every modification) mandates that every critical file be copied before modification, with naming conventions including session tag and Unix timestamp. R2 (syntactic verification after every modification) requires that every modified file undergo syntactic verification (php -l, python3 -m py_compile, node --check, JSON parse) immediately after modification, with automatic rollback to backup on failure. R3 (pre-session health check) mandates infrastructure state verification before beginning any operation. R4-R7 cover respectively: mandatory distinction between verified/inferred/unverified state, git commit at every milestone, multi-layer handover, and knowledge graph update.

4.2 Anti-Cliffhanger Constraints (AC1-AC6)

Anti-cliffhanger constraints prevent the most common failure pattern in extended agent sessions: abrupt interruption without handover. AC1 mandates periodic work state checkpoints every N operations. AC2 requires proactive artifact saving to persistent storage before reaching budget limits. AC3 mandates estimating residual tool call budget before beginning complex operations, to avoid starting non-completable tasks with remaining resources. AC4-AC6 cover specific edge cases: mid-session compaction management, network timeout recovery, and graceful termination protocol when budget is exhausted.

4.3 Validators (V1-V7)

Validators implement the principle that certain operations must pass formal checks before execution, regardless of the agent's or user's subjective assessment. V1 (validate_structure) verifies project structure coherence. V2 (validate_taxonomy) checks path and content conformity to naming rules. V3 (validate_retention) validates destructive operations against retention policy. V4 (validate_cleanup_plan) requires explicit cleanup plan approval before execution. V5-V7 cover operational channel validation, formal mission closure, and complete sequential validator execution.

5. Epistemological Principle: Truth Is Law

The protocol is unified by an epistemological principle formulated as an inviolable rule of collaboration: truth is law.

The agent must never declare unverified completeness. Must never fabricate data to fill informational gaps. Must never assert "it works" based on deduction when verification is possible. Must never conceal relevant warnings to preserve an appearance of success. Every agent assertion must be accompanied by its certainty level, explicitly indicated: verified (I executed the test and the result is X), inferred (based on evidence Y, I consider Z probable), unverified (I have no elements to evaluate), blocked (I cannot proceed for reason W).

This principle is not a moral petition. It is an architectural requirement: a multi-session system where session N's assertions are used as premises by session N+1 must ensure the inference chain rests on explicitly qualified foundations. An unverified assertion in session N treated as fact in session N+1 is the informational equivalent of a null pointer dereference: a silent error propagating until producing a catastrophic failure at a distance.

6. Empirical Evidence and Operational Metrics

The Relay Method was developed and validated through a real case study: the construction of MasterBridge, a self-hosted MCP server with multi-provider authentication, AES-256-GCM encrypted credentials with HKDF-SHA256 key derivation, append-only audit log, GitHub PR-style permission system, and OpenAI compatibility via bearer token. The project required 28 hours of development distributed across 7 chronologically linked sessions.

Documented operational data provide a first quantitative benchmark for method evaluation. The 7 sessions produced a total of 29 tool invocations tracked with MD5 round-trip integrity verification, 32 versioned knowledge entries in the graph with timestamp and source, 7 evolutionary super-prompt versions (from v4.7 to v5.1, with documentable diffs), and 21 numbered operational lessons, each derived from a documented failure-and-recovery episode. Budget metrics indicate efficient resource utilization: session S6b consumed 17 tool calls out of 20 available (85%), session S7 consumed 17 out of 30 (57%), with an overall average of 71% productive budget utilization.

Two documented failure episodes are particularly instructive for method evaluation. In session S4, 2,841 lines of code remained trapped in the ephemeral sandbox when the session interrupted without checkpoint — a failure that motivated the introduction of anti-cliffhanger constraints AC1 and AC2. In session S7, a cliffhanger on rule R7 (Baton Relay update) caused partial loss of the handover, resulting in decision drift in session S7b. Both episodes, though costly in terms of time, produced lessons codified in the protocol that prevented analogous failures in subsequent sessions — an institutional learning mechanism absent in agentic frameworks that do not maintain an explicit record of past errors.

7. Positioning and Future Work

The Relay Method occupies a design space that the literature on LLM agents has thus far underexplored: the methodological level between the technical framework (ReAct, LangChain, AutoGPT) and the specific application (Cursor, GitHub Copilot, Devin). It is not a technical framework — it does not prescribe a language, a platform, or a model architecture. It is an operational protocol implementable atop any framework, with any model, and in any domain.

The method's counterintuitive proposition is that for production tasks where correctness and auditability matter more than speed, methodological discipline surpasses agent autonomy as a predictor of success. A disciplined agent producing verifiable artifacts and auditable handovers is systematically more useful than an autonomous agent producing non-replicable results — even when the autonomous agent is faster in individual sessions.

Future research envisions three directions: a controlled experiment with N≥10 identical task pairs executed with and without the Relay protocol, to quantify the effect on token efficiency, error repeat rate, decision consistency, and task completion rate; open-source release of the complete reference implementation (Baton Relay + MasterBridge + Working State Mirror); and theoretical formalization of the protocol in terms of Byzantine fault tolerance applied to multi-agent systems with shared memory. An academic paper is currently in preparation targeting submission to a peer-reviewed workshop in 2026-2027.

Note on Originality and Intellectual Property

The Relay Method was conceived, developed, and validated entirely by the author from 2024 onward, in the context of real software engineering projects conducted with frontier LLM models. It is documented in Chapter 19 of the book La Macchina che Mente (in press) and has been cited in academic literature, including work from the Harvard Graduate School of Education on human-AI collaboration in software development contexts.

Giuseppe Siciliani Independent Cybersecurity Researcher & AI Consultant, Milan Media Lives Cybersecurity Research Lab (MLCSL), Media Lives S.r.l.