In this video above, we demonstrate the learning process of one of our LLM Agents being taught how to use an API to control a drone quadcopter. The initial stages require us to provide the Agent with detailed and comprehensive instructions about how to send HTTP requests and what commands are available through the API. As the video progresses, the Agent quickly grasps these instructions and leverages the knowledge it already has to perform advanced and intricate tasks, like flying the drone following a square trajectory. This showcases the Agent’s resilience and adaptive learning capabilities – how the Agent recovers from errors and false assumptions.
This version of the continual-learning agents represents a significant advancement from our first prototype (agent embodied in a Python terminal). This enhanced Agent has access to distinct forms of working memory and long-term memory, enabling it to effectively manage several types of memory inconsistencies, such as contradictions or outdated information, and can learn from user feedback and environmental cues. You can think about it like a cognitive architecture built on top of LLM.
The agent’s response is a result of a sequence of nested steps. This method augments the LLM’s cognitive resources and attention span to extend beyond the limits of LLM (LLMs are stateless, have fixed size context, don’t pay sufficient attention to all instructions in the prompt, etc.). Notably, the process employs iterative prompting, using multiple prompts to accomplish tasks that the LLM can’t perform in a single inference—such as retrieving and summarizing memories or maintaining the agent’s state for future steps. Every step in this sequence receives input data, processes it to produce relevant output data, and shares this output across the entire chain. Consequently, formulating a response becomes a joint effort. Each step in the chain has the autonomy to determine the information it requires, what new state needs to be stored in working memory and long-term memory, if it needs to consult the LLM, and what specific prompt should be fed to the LLM for each iteration.
Initially, the agent is in a “blank state,” knowing nothing about the drone API. Therefore, we commence by closely guiding its actions, telling the agent the exact URL that it has to communicate with and the exact form and value of the HTTP requests it has to send. However, as time progresses, the Agent gradually accumulates knowledge and expertise, enabling it to tackle increasingly complex tasks based on its past experience.
We actively incentivize the agent to create functions or any kind of reusable code for representing learned skills, which the agent ends up carrying out without us having to tell it to do so. In addition, the agent interacts with a Python terminal as its means for performing actions, so it will instantly become aware of any malfunction or error in its code, being able to fix it on the fly and learning from its mistakes. Finally, thanks to its episodic memory, the agent can relate current situations with past ones, remembering errors made during its instruction or functions that it wrote long ago and might leverage just now.
The video demonstrates that the terminal prototype ended up being converted into one stage of the pipeline (the TerminalAction), which leverages the Python terminal environment plus the memories retrieved by a previous stage to carry out the agent’s actions. On top of this, other stages add functionality like terminal session persistence and generation of new memories.
Take a look at more videos below.
In this recording, we teach the Agent to fly the drone in a circle. Because the Agent still ignores some aspects of the drone API, we must be specific about certain things, like using the blocking commands instead of the asynchronous ones. The Agent flies the drone following the user specification: the circle has a 50-meter radius, and its center is the current drone position. We have sped up the video 4x for visualisation purposes.
This footage demonstrates how we teach the agent to comb an area inside four GPS points defined by the user. First, we let the agent memorize the four GPS points by manually sending the drone to each position and asking the agent to remember them with different names. Then, we explain to the agent what we expect it to do: thoroughly fly over that area by following zig-zagging corridors of 10 meters in width. This is extremely useful for planning exhaustive search operations.
The video has been sped up 4x during the teaching and 16x during the combing phases.
Here’s a video showcasing speech-to-text as input and text-to-speech as output. The agent has already been given a list of functions that it can use to interact with the drone API. This addition enhances the user interaction and adds a social dimension to the agent. It makes the interaction more engaging.
Note this is still an early prototype, and many improvements are in the works.
Link to the YouTube playlist containing all videos.
Leave a comment