From Tool Calls to an Event-Driven Runtime: The Runtime Architecture of LLM Agents

LLM agents should not be understood as a chain of tool calls; as tasks grow longer and environments become more complex, the runtime needs an event-driven architecture for state, concurrency, observation, and recovery.

Many designs for large language model agents still assume a simple flow: the model proposes a tool call, the tool runs, the result comes back, and the model continues reasoning. This flow treats interaction as a synchronous function call. The input is a set of parameters, and the output is text, an image, or some fixed object. That works for simple queries, but it does not describe how real systems behave.

Real tools often do not return once and finish. A tool may stream logs, produce intermediate artifacts, wait for an external service, or run for a long time without a clear stopping point. Users also do not only provide one request at the beginning. They continue to operate the interface, trigger services, and change the environment. At the same time, an agent’s response is not limited to text. It may be speech, an automation, a reminder, an external service operation, or even a robotic action.

For that reason, an agent runtime should not be understood as a linear chain in which “the model calls tools.” It is better understood as an event-driven control system. The large language model is not the sole main program. It is a planner responsible for interpreting state, updating judgment, and choosing actions. Tools, logs, user behavior, scheduled tasks, and robotic sensors are all event sources. The real problem is not how to make the model call tools more often. It is how to organize continuous, asynchronous, noisy changes into state that can support decisions.

Tool Calls Are Not Single Function Returns

The conventional tool-calling model assumes a fixed form: the model sends a request, the tool runs, a result comes back, and the model continues reasoning. The problem is that this treats a tool as a short-lived function, while many tools are really processes.

Code execution keeps printing output. Browser automation moves through loading, waiting, failure, and retry. Model training, data processing, literature review, and service deployment may all take a long time. A robot executing an action also does not produce one final result at once; it keeps changing in an environment and producing feedback.

If the system forces the planner to wait until the tool finishes, the agent loses control over the process. It cannot notice a hang, an error, a goal drift, or a new decision opportunity in time. On the other hand, if every line of output wakes the planner and triggers fresh reasoning, the system is dragged down by noise. The context fills with logs, and compute is wasted.

A better approach is to treat tool execution as an observable task. Once the task starts, it continuously produces events: started, progress updated, stage completed, errored, waiting, completed, canceled. The planner should not read every raw event directly. It should receive filtered and compressed state changes through an observation layer.

This structure combines three strategies. First, prefer event-driven updates: when a tool produces meaningful new state, the system should notice immediately. Second, throttle and compress high-frequency output: not every fragment deserves to wake the planner. Third, keep low-frequency polling as a fallback: real systems may lose callbacks, delay logs, or drop connections, and periodic checks prevent silent failure.

So the choice is not between “observe on a timer” and “respond to every update.” The better pattern is event-driven first, filtered by an observation layer, with polling as a fallback.

An Observer Is Not a Smaller Planner

This architecture needs an observer. The observer may be implemented with rules, a smaller model, or a language model in more complex cases, but it should not be designed as a free-form miniature planner. Its responsibility should stay narrow: decide whether the task is still running, whether an error has appeared, whether a stage boundary has been reached, whether the planner needs to intervene, and how to compress continuous output into structured state.

The planner is responsible for goals, strategy, and replanning. The observer is responsible for sensing, monitoring, alerting, and summarizing. The clearer this boundary is, the more stable the system becomes.

A capable observer should not pass large logs to the planner verbatim. It should produce a state packet that answers questions such as: what stage is the task in; what has changed since the last report; is there an anomaly; does the planner need to decide; if so, what candidate actions exist; and what minimal evidence supports that judgment.

The raw stream should still be kept for audit and debugging. But in normal operation, the planner should consume structured state and semantic summaries, not the raw stream itself. Only when deeper diagnosis is needed should the system go back to the original records.

There is a useful analogy with human perception. When walking or driving, a person can think about other things while relying on vision and low-level reactions to avoid obstacles. Higher-level cognition does not process every visual frame. Only important changes enter attention. Agent systems should work in a similar way: lower layers continuously sense and quickly filter, while higher layers handle changes that matter to the goal.

But the analogy is only a guide. It cannot replace engineering design. Human low-level reactions are supported by a large body of prior knowledge built up over time. An agent system must explicitly define what counts as an anomaly, what counts as progress, and what should be escalated. The lower the layer, the more it should rely on clear rules, state machines, format checks, and thresholds. Open-ended reasoning belongs higher up.

For long-running tasks, the key question is not how long to wait. It is when to decide again. A running task can usually be grouped into four kinds of state.

The first is normal progress. The task is producing output, making progress, and showing no anomaly. The planner should not intervene frequently here. The observer only needs to maintain heartbeat and stage summaries.

The second is a local update. A tool produces new logs, an intermediate image, or a partial result, but the information is not enough to change the plan. The system updates task state without entering full reasoning.

The third is a decision point. The task reaches a branch: multiple results appear, extra parameters are required, a new constraint is discovered, a checkpoint is reached, or the current evidence is insufficient for automatic progress. The observer should compress the relevant context and hand it to the planner, which then decides the next step.

The fourth is an anomaly point. The task is stuck, times out, fails repeatedly, consumes unusual resources, drifts away from the goal, or creates a safety risk. The system may need to interrupt, retry, roll back, switch tools, ask the user for confirmation, or terminate the task.

The core of long-task management is turning a continuous process into a small number of decision-worthy nodes. Without that discretization, the planner either waits blindly or gets interrupted by low-value signals.

Observation frequency should not be fixed either. When a task starts, changes are often frequent, so observation can be denser. During stable execution, the rate can drop. Near a timeout, around anomaly signs, or close to an important checkpoint, attention can rise again. This risk- and change-based observation pattern fits real systems better than simple periodic polling.

Logs from Long-Running Services Should Be Sliced by User Action

Another important case is starting a service that keeps running. The service continuously produces logs, and some of those logs correspond to user actions. After a user acts, the planner needs to understand the relevant logs, form a judgment, and plan what comes next.

The core problem is not “let the model read logs.” It is to build a causal link among the user action, service state, log fragments, and later planning. Without that link, the planner sees many logs but cannot tell which ones were caused by the current user action and which ones are just background noise.

Each user action should therefore create an action record. The record should include an action ID, time, user intent, target service, expected impact, and any related request ID, session ID, or trace ID. The system then creates an observation round around that action.

An observation round has clear boundaries. It does not watch the whole log stream indefinitely. After the user action, it captures logs from the relevant time window and relevant components. For example, after the user clicks “deploy,” the system should focus on the deployment service, scheduler, dependency fetches, and health checks, not every background heartbeat. The round can end when a success or failure signal appears, when the waiting limit is exceeded, or when no relevant new information appears for a while.

In this way, a continuous log stream is cut into small fragments related to user actions. The planner no longer faces an unbounded ocean of logs. It receives a local round with a starting point, evidence, and current state.

The observer has four responsibilities here: select relevant logs, decide when observation should end, compress the evidence, and decide whether to escalate to the planner. It should not output long logs. It should output conclusions: whether the user action took effect, whether the current state is success, failure, in progress, or uncertain, where the anomaly most likely occurred, what evidence is still missing, and what actions are available next.

Logs are not the only source of truth. They may be incomplete, delayed, out of order, or show success even though the user sees no effect. A robust system should cross-check logs with metrics, state queries, UI observation, database reads, or API results. A more reliable chain is: user action triggers an observation round, the system gathers logs and other evidence, forms a state hypothesis, verifies the key assumptions, and then lets the planner decide the next step.

Beyond action-triggered observation, the system also needs background health monitoring. Some problems are not caused directly by a single user action but accumulate over time: queue buildup, resource exhaustion, session expiration, or failed background jobs. Action observers explain “what just happened.” Background observers warn “what may happen next.” Together, they help the agent react not only to explicit operations but also maintain an ongoing understanding of its environment.

Responding to the User Is More Than Producing Text

An agent’s response should not be understood as “reply with a sentence.” Text is only one kind of action. The system may need to explain something to the user, speak aloud, generate a report, call an external service, create a scheduled task, place an order, control a device, or direct a robot.

A more unified design is to model every response as an action. An action may be a message, speech, an API call, a one-off task, a scheduled task, a continuous behavior, or a robotic motion. What these have in common is that the system chooses a way to change the user’s understanding, the state of software, or the physical world based on the current goal and environment.

Informational responses change what the user understands: raw text, formatted text, tables, reports, or progress updates. Speech can be treated as a presentation form of an informational response: the model decides what to say, and the system turns it into audio. Software operations change the state of external systems, such as sending email, editing a document, calling a service, generating material, or running a script. Physical behavior goes further and changes the real environment, such as moving a robot, picking up an object, or controlling a device.

The planner’s output should therefore not be a free-form paragraph, but a set of action plans. A single plan may include several actions: generate a literature summary, send it to the user, and create a nightly reading reminder. They belong to the same goal, but different executors carry them out.

This also requires permission and risk control. Summarizing papers can usually run automatically. Creating a reminder may require light confirmation. Placing an order, making a payment, deleting data, or controlling a device should require explicit authorization. The more an agent can change the world, the clearer its approval policy must be. Otherwise, the system may look powerful but become untrustworthy.

A safer division of labor is this: the planner decides the semantics of the action; the policy layer decides whether it is allowed; the execution layer turns the action into concrete operations; and the observation layer feeds the result back to the planner.

In Implementation Terms, the Runtime Looks More Like an Event-Driven Concurrent System

In implementation terms, this runtime is closer to an event-driven concurrent system than to a synchronous call chain. Users, tools, observers, schedulers, and executors can all produce events and consume events. The large language model is not the sole main program. It is an upper-layer planner that interprets state, adjusts goals, and chooses the next action when important events arrive.

Here, “event-driven concurrent system” is only a structural summary of the runtime. It explains why tools, observers, schedulers, and executors should not be squeezed into one synchronous call chain. The further question is: in such a system, who initiates, who waits, and who yields control? That question needs the lens of communicating processes and coroutines.

In other words, the event-driven runtime answers how the system should be composed. Communicating processes and coroutines explain how the system runs. The first is an architectural shift; the second is an explanatory model.

A Unified Runtime Structure

Putting these pieces together gives a more complete picture of an LLM agent runtime.

At the bottom are the environment and tools. These include external services, logging systems, document systems, web pages, robots, databases, and automation interfaces. They continuously produce events and receive action requests.

The middle layer contains executors and observers. Executors turn action plans into concrete operations. Observers extract meaningful changes from logs, tool output, state queries, and environmental feedback. Schedulers handle one-off and periodic tasks. Together, these components maintain the running state of the system.

At the top is the planner. It does not watch every raw output and does not handle low-level monitoring. It receives compressed observations, maintains a judgment about goals and environment, and chooses the next action. It intervenes only when the goal is affected, the plan branches, an anomaly needs handling, or the value of new information rises.

The user is both a source of input and an object affected by actions. The agent may explain current state to the user, ask for confirmation, report progress, show results, speak aloud, or perform authorized external operations. The user is not outside the system as a passive observer. The user is part of the interaction environment.

This structure can be summarized as a two-loop system. The inner loop is fast sensing and execution: continuous observation, error detection, task progress, and low-risk reaction. The outer loop is slower cognition and planning: goal interpretation, strategy adjustment, complex judgment, and high-risk decisions. The inner loop keeps the system stable. The outer loop keeps it oriented.

Conclusion

The key to LLM agents is not simply making the model better at calling tools. It is redesigning the relationship among tools, observation, action, and planning.

Tools should not be synchronous functions. They should be observable tasks. Logs should not be pushed directly into context. They should form observation rounds and evidence packets around user actions. Responses should not be treated only as text. They should be modeled as actions. The large language model should not hold all control. It should operate as a planner inside an event system made of observers, executors, schedulers, and the external environment.

The significance of this architecture is that it accepts the continuity and uncertainty of real interaction. The agent does not face a static problem with one input and one output. It faces a changing environment. It must observe, judge, act, and observe again. Only when the system can organize continuous events into decision-ready state, and unify many output forms into controllable actions, does an LLM agent truly move from chat program to runtime system.

The event-driven runtime answers how an agent system should be organized. The next question is: in such a system, who is calling whom? That question needs to be rethought through communicating processes and coroutines.