AI RFP & Tender Automation for Pharma - discover, parse, respond and win 3x more bids with our Agentic Tender Automation Solution. Know More →

Context Windows Are Not Enough: How We Built Long-Context Memory for Enterprise Pharma Workflows

SwishX Engineering

There is a class of problem in enterprise AI that retrieval-augmented generation cannot solve, and it took us longer than we would like to admit to see it clearly. We kept reaching for RAG because it is well-understood, well-supported, and works well in demos. It works less well in production when your AI needs to reason across a company's full commercial history to take a consequential action.

This is the story of why RAG was the wrong abstraction for what we were building, and what we built instead.

The problem with retrieval as a proxy for memory

RAG solves a real and well-defined problem: an LLM has a finite context window, your knowledge corpus is larger than that window, so you retrieve the most semantically relevant chunks at query time and include them in the prompt. The model reasons over the retrieved context. The output is grounded in your corpus rather than in the model's parametric knowledge. This is genuinely useful for a wide class of applications.

The failure mode emerges when you apply RAG to a problem where what the agent needs is not the most semantically relevant documents from a corpus. It is a synthesised understanding of an organisational relationship that has history, nuance, and structure distributed across dozens of systems and hundreds of events over months or years.

Consider a concrete example from SwishX's Contract IQ module. An agent is generating a hospital rate contract for a large private hospital group. What does it need to do this well? It needs to know the terms of previous contracts with this group and how those negotiations went. It needs to know whether there were any supply performance issues under those contracts and how they were resolved. It needs to know the procurement team's historical sensitivity to specific clause types. It needs to know the strategic priority of this account relative to the company's broader institutional portfolio. It needs to know the current competitive context for this therapy area in this geography.

None of these information requirements are satisfied by retrieving the top-K most semantically similar documents from a contract corpus. The most semantically similar documents are probably previous contracts with the same hospital, which is a start. But the agent does not need to read those contracts in full. It needs to understand what they mean about the relationship. That understanding is a synthesis, not a retrieval.

RAG retrieves. It does not synthesise. That is the boundary.

What organisational memory actually is

The concept we needed was not a better retrieval system. It was organisational memory: a persistent, structured, continuously updated representation of the company's commercial state that agents can query at multiple levels of abstraction.

The analogy that clarified this for us internally was the difference between a filing cabinet and a domain expert colleague. A filing cabinet contains all the documents. A retrieval system is a sophisticated search interface for that filing cabinet. Neither is memory in the sense that matters for agent reasoning. A colleague who has been at the company for three years and has been involved in the relationship with this hospital account for two of those years has memory. They have synthesised the raw events into an understanding that is queryable at the level of abstraction the current task requires: what is our relationship with this hospital like, what are the things we have to be careful about, what is their procurement team's decision-making style.

Building organisational memory for AI agents required us to answer a design question that retrieval architectures do not pose: what is the right representation of the company's commercial state, at what level of abstraction, for the agent to reason over efficiently and accurately?

The four-layer memory architecture

We ended up with four layers, each operating at a different time horizon and level of abstraction. The layers are not a hierarchy where higher is better. They are complementary representations that serve different agent reasoning needs.

Layer 1: Episodic Memory is the foundational layer. It is an append-only, immutable log of commercial events: every contract generated, every HCP engagement, every tender submitted, every channel intervention made, every outcome observed. Events are written once and never modified. Corrections are new events that reference the original event they are correcting. This immutability is a design constraint that comes from regulatory requirements in several of the frameworks we operate under, and it turns out to be a good architectural constraint regardless of regulatory requirements. An immutable event log is an extremely reliable foundation for the higher layers to build on.

Episodic memory is not queried through semantic similarity. Agents query it through structured queries against entity identifiers, time ranges, and event types. When the contract generation agent needs to know the history of contracts with a specific hospital, it issues a structured query against the episodic log for that hospital entity, not a semantic similarity search over contract documents. The precision requirements for episodic memory retrieval are too high for embedding-based retrieval to give acceptable results.

Layer 2: Semantic Memory is the synthesised understanding layer. A background synthesis process reads the episodic event log and maintains a continuously updated representation of what the events mean: what is this hospital's negotiation pattern, what HCP engagement approaches have worked in this therapy area, what is the competitive context in this territory. This layer does use embedding-based retrieval because semantic queries over synthesised understanding are genuinely well-matched to semantic similarity search.

The critical architectural decision here is that synthesis happens in the background, not at agent query time. Agents consume pre-synthesised semantic memory. They do not synthesise it themselves during task execution. This matters for two reasons. First, it keeps agent latency predictable. Second, and more importantly, it prevents agents from performing lossy on-demand synthesis under the time pressure of task execution. Background synthesis can be done carefully, with sufficient context and deliberation. On-demand synthesis during agent reasoning is susceptible to the same errors that make RAG insufficient in the first place.

The synthesis process itself is an LLM-based process, which means it introduces its own failure modes. We treat semantic memory as a lossy compressed representation of episodic memory and design the agent reasoning process to fall back to episodic queries when the precision requirements exceed what semantic memory can reliably provide.

Layer 3: Working Memory is the task-scoped context layer. It is the information the agent maintains during a single task execution: the goal, the current plan, the actions taken and their outcomes, and the pending decisions. Working memory is ephemeral. It does not persist after the task completes. But before it is discarded, its contents are written to episodic memory as a task execution event, which means the episodic layer captures not just commercial outcomes but the agent's reasoning process that produced those outcomes. This creates a richer training and debugging signal than capturing outcomes alone.

The working memory prompt is assembled dynamically at task initiation from Layer 1 and Layer 2. The assembly algorithm is one of the places we invest the most engineering effort, because the quality of the assembled context is the primary determinant of agent output quality. A naive top-K semantic retrieval assembly and our current relevance-density-optimised assembly produce measurably different outputs on the tasks that matter most. We run regular ablations to validate that the algorithm is improving and to catch regressions.

Layer 4: Procedural Memory is the learned task execution pattern layer. It captures what the agent has learned about how to execute specific task types effectively based on historical outcomes. Which contract structures get accepted faster by which counterparty profiles? Which personalisation approaches produce the highest HCP engagement uplift in which therapy area and geography combinations? Which tender submission approaches correlate with higher technical qualification rates?

Procedural memory is updated through a reinforcement signal derived from task outcomes, on a weekly batch cycle. The deliberate slowness of the update cycle reflects a design choice: procedural memory updates should be driven by statistical patterns across many outcomes, not by individual outcomes that may be anomalous. Rapid procedural memory updates based on sparse outcome signals produce instability in agent behaviour that manifests as inconsistency across similar tasks. Weekly batch updates with outcome volume thresholds before an update is triggered produce stable, improving behaviour.

The cold start problem

Every enterprise customer starts with zero organisational memory. No episodic events. No synthesised semantic understanding. No learned procedural patterns. The agent has to operate effectively from day one despite having none of the memory that makes its long-run performance possible.

Cold start is a genuine hard problem and we have not fully solved it. Our current approach has three components.

The first is prior models: semantic memory representations bootstrapped from our aggregate anonymised understanding across the customer base, calibrated to the customer's industry vertical, company size, and therapy area focus. These priors are intentionally weak and designed to be overridden quickly by the customer's own episodic data as it accumulates. But they are better than a blank slate for the first few weeks of operation.

The second is document ingestion at onboarding: customers bring historical contract documents, HCP engagement records, and channel performance data that we ingest into the episodic layer at setup. This does not replicate the richness of events generated through live platform usage, but it provides a historical foundation that meaningfully reduces the cold start period.

The third is uncertainty-aware action class downgrade: during the cold start period, the agent's action class authority is automatically reduced. Actions that would normally qualify as Class 2 are treated as Class 3 requiring explicit authorisation. Actions that would normally qualify as Class 3 require additional confirmation steps. As the episodic memory accumulates sufficient volume to support confident reasoning, the action class authority gradually restores to its nominal level. The customer sees this as the system becoming more autonomous over time as it learns their organisation, which is an accurate description of what is happening.

What this changes about how we think about LLM selection

Building this memory architecture changed our perspective on LLM selection in a way we did not anticipate.

Before we had the memory architecture, our LLM selection criteria were dominated by raw capability: which model performs best on the task types we care about. After building the memory architecture, we found that the quality of agent outputs was much less sensitive to the choice of base model than to the quality of the context assembled from the memory layers. A moderately capable model with high-quality, well-assembled memory context consistently outperformed a highly capable model with poorly assembled context.

This has practical implications for how we think about the cost and latency trade-offs in LLM selection. The expensive frontier models are worth the cost for tasks where raw reasoning capability is the bottleneck. For most production tasks in our system, memory context quality is the bottleneck, not model capability. Getting the memory architecture right compounds more than upgrading the model.

This is not a claim that model quality does not matter. It is a claim about where the leverage is. For enterprise commercial AI with good memory architecture, the leverage is in the memory layer. For consumer AI or research applications where the task is genuinely novel at inference time, the leverage is more clearly in the model.

What we are still working on

The memory architecture described here is the current production version. There are several open problems we are actively working on.

The synthesis quality problem: background synthesis of episodic memory into semantic memory is only as good as the synthesis process, which is itself an LLM-based process with its own failure modes. We are building automated evaluation infrastructure to detect semantic memory drift, where synthesised understanding diverges from the underlying episodic record in ways that produce agent errors.

The multi-tenant isolation problem: different customers' organisational memories must be completely isolated, but there are categories of learned patterns that are customer-agnostic and should be shared. Getting this isolation and sharing boundary right is an ongoing design and engineering challenge.

The memory decay problem: not all episodic events are equally relevant indefinitely. A contract negotiation pattern from three years ago may be less relevant than one from three months ago for predicting current counterparty behaviour. We are building decay models for semantic memory that weight recency appropriately without discarding genuinely durable historical patterns.

These are hard problems and we do not have clean solutions to all of them. What we do have is an architecture that makes them addressable in principled ways, which is the best starting point available.

Ask AI how SwishX powers commercial excellence in Pharma

X

Download Pharma Report 2026
Submit your info & we'll send you the full report for free

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.