To understand what LLM Agents are, let's first explore the basic capabilities of an LLM. Traditionally, an LLM does nothing more than next-token prediction.
By sampling many tokens in a row, we can mimic conversations and use the LLM to give more extensive answers to our queries.
One of the main disadvantages of LLMs is that they do not remember conversations:
LLMs also often fail at basic math like multiplication and division:
Through external systems, the capabilities of the LLM can be enhanced. Anthropic calls this The Augmented LLM.
For instance, when faced with a math question, the LLM may decide to use the appropriate tool (a calculator External tools that extend LLM capabilities).
"An agent Entity that can perceive and act upon its environment is anything that can be viewed as perceiving its environment through sensors Components used to observe the environment and acting upon that environment through actuators Components used to interact with the environment."
— Russell & Norvig, AI: A Modern Approach (2016)
Agents interact with their environment and typically consist of several important components:
We can generalize this framework to make it suitable for the Augmented LLM:
Using the Augmented LLM, the Agent can observe the environment through textual input (as LLMs are generally textual models Models that process and generate text) and perform certain actions through its use of tools (like searching the web Using search engines to retrieve information).
This planning behavior allows the Agent to understand the situation (LLM), plan next steps (planning), take actions (tools), and keep track of the taken actions (memory).
LLMs are forgetful systems, or more accurately, do not perform any memorization at all when interacting with them. For instance, when you ask an LLM a question and then follow it up with another question, it will not remember the former.
We typically refer to this as short-term memory, also called working memory, which functions as a buffer for the (near-) immediate context. This includes recent actions the LLM Agent has taken.
The LLM Agent also needs to keep track of potentially dozens of steps, not only the most recent actions. This is referred to as long-term memory as the LLM Agent could theoretically take dozens or even hundreds of steps that need to be memorized.
The most straightforward method for enabling short-term memory is to use the model's context window, which is essentially the number of tokens an LLM can process.
This works as long as the conversation history fits within the LLMs context window and is a nice way of mimicking memory. However, instead of actually memorizing a conversation, we essentially tell the LLM what that conversation was.
For models with a smaller context window, or when the conversation history is large, we can instead use another LLM to summarize the conversations that happened thus far.
Long-term memory in LLM Agents includes the agents past action space that needs to be retained over an extended period. A common technique to enable long-term memory is to store all previous interactions, actions, and conversations in an external vector database.
This method is often referred to as Retrieval-Augmented Generation (RAG) Technique to enhance LLM responses with external knowledge retrieval.
Tools allow a given LLM to either interact with an external environment (such as databases) or use external applications (such as custom code to run).
Tools generally have two use cases: fetching data to retrieve up-to-date information and taking action like setting a meeting or ordering food.
To actually use a tool, the LLM has to generate text that fits with the API of the given tool. We tend to expect strings that can be formatted to JSON so that it can easily be fed to a code interpreter.
You can also generate custom functions that the LLM can use, like a basic multiplication function. This is often referred to as function calling Technique where LLMs can call specific functions with parameters.
Some LLMs can use any tools if they are prompted correctly and extensively. Tool-use is something that most current LLMs are capable of.
Tools can either be used in a given order if the agentic framework is fixed:
Or the LLM can autonomously choose which tool to use and when. LLM Agents are essentially sequences of LLM calls (but with autonomous selection of actions/tools/etc.).
Planning in LLM Agents involves breaking a given task up into actionable steps.
This plan allows the model to iteratively reflect on past behavior and update the current plan if necessary.
Planning actionable steps requires complex reasoning behavior. As such, the LLM must be able to showcase this behavior before taking the next step in planning out the task.
Reasoning LLMs are those that tend to think before answering a question.
This reasoning behavior can be enabled by roughly two choices: fine-tuning the LLM or specific prompt engineering. With prompt engineering, we can create examples of the reasoning process that the LLM should follow.
This methodology of providing examples of thought processes is called Chain-of-Thought Prompting technique where the model explains its reasoning step by step and enables more complex reasoning behavior.
Chain-of-thought can also be enabled without any examples (zero-shot prompting) by simply stating "Let's think step-by-step."
One of the first techniques to combine both reasoning and action processes is called ReAct (Reason and Act).
ReAct does so through careful prompt engineering. The ReAct prompt describes three steps:
It continues this behavior until an action specifies to return the result. By iterating over thoughts and observations, the LLM can plan out actions, observe its output, and adjust accordingly.
Nobody, not even LLMs with ReAct, will perform every task perfectly. Failing is part of the process as long as you can reflect on that process.
This process is missing from ReAct and is where Reflexion Technique where agents learn from past mistakes through verbal reinforcement comes in. Reflexion is a technique that uses verbal reinforcement to help agents learn from prior failures.
Memory modules are added to track actions (short-term) and self-reflections (long-term), helping the Agent learn from its mistakes and identify improved actions.
A similar and elegant technique is called SELF-REFINE, where actions of refining output and generating feedback are repeated.
The single Agent we explored has several issues: too many tools may complicate selection, context becomes too complex, and the task may require specialization.
Instead, we can look towards Multi-Agents, frameworks where multiple agents (each with access to tools, memory, and planning) are interacting with each other and their environments:
These Multi-Agent systems usually consist of specialized Agents, each equipped with their own toolset and overseen by a supervisor. The supervisor manages communication between Agents and can assign specific tasks to the specialized Agents.
In practice, there are dozens of Multi-Agent architectures with two components at their core:
One of the most influential Multi-Agent papers is "Generative Agents: Interactive Simulacra of Human Behavior". In this paper, they created computational software agents that simulate believable human behavior.
The Memory module is one of the most vital components in this framework. It stores both the planning and reflection behaviors, as well as all events thus far.
Whatever framework you choose for creating Multi-Agent systems, they are generally composed of several ingredients, including its profile, perception of the environment, memory, planning, and available actions.
Popular frameworks for implementing these components are AutoGen, MetaGPT, and CAMEL. However, each framework approaches communication between each Agent a bit differently.
LLM Agents represent a powerful evolution in AI capabilities, combining the natural language understanding of large language models with the ability to interact with their environment through tools, retain information in memory systems, and plan complex actions.
The field continues to evolve rapidly, with advances in:
As these technologies continue to mature, we can expect LLM Agents to take on increasingly complex tasks and provide more sophisticated assistance across a wide range of domains.
"This concludes our journey of LLM Agents! Hopefully, this guide gives a better understanding of how LLM Agents are built."
— Maarten Grootendorst