Badger Seminar Summer 2021 Summary: Beyond Life-long Learning via Modular Meta-Learning

August 17, 2021

By GoodAI team
26 – 30 July 2021

We recently held the Badger Seminar titled: Beyond Life-long Learning via Modular Meta-Learning, with participants joining online and at our headquarters in Prague. The aim of the five-day seminar was to bring together the GoodAI Research team, our GoodAI Grants recipients, and other experts in the field in order to advance the research in life-long learning, meta-learning, multi-agent learning and other directions that we believe are relevant for building generally intelligent agents. 

Our regular Badger Seminars borrow their name from GoodAI’s Badger architecture. The desired outcome of our research is the creation of a lifelong learning system that is able to gradually accumulate knowledge and effectively re-use such knowledge for the learning of new skills. Such a system should be able to continually adapt and learn to solve a growing, open-ended range of new and unseen tasks and operate in environments of increasing complexity. 

At this seminar we discussed possible pathways to designing such a system through careful meta-learning of a distributed modular learning system coupled with the appropriate minimum viable environment/dataset, cultivating the necessary inductive biases to afford the discovery of a lifelong learner with such properties.

We spent five days, each with a dedicated theme, discussing in small groups. Below you can find some of the key takeaways and collated notes from each day.  

Participants who were able to make it in person joined us in the GoodAI headquarters the Oranžérie.


Miguel Aguilera, Ferran Alet, Pasquali Antoine, Kai Arulkumaran, Paul Bertens, Martin Biehl, Blake Camp, Michele Campolo, Wendelin Böhmer, Christopher Buckley, Matt Crosby, Aaron Dharna, Sam Earle, Rolando Estrada, Kevin Franz, Roberto Gallotta, David Herel, Miklos Kepes, Mahdi Khosravy, Samuel Kiegeland, Tomas Mikolov, Deepak Pathak, Will Redman, Mark Sandler, Lisa Soros, Julian Togelius, Nathaniel Virgo, Max Vladymyrov, Olaf Witkowski, Pauching Yap, Dominika Zogatova.

GoodAI team: Olga Afanasjeva, Simon Andersson, David Castillo Bolado, Joseph Davidson, Jan Feyereisl, Nicholas Guttenberg, Petr Hlubuček, Martin Poliak, Isabeau Premont-Schwarz, Marek Rosa, Petr Simanek, Jaroslav Vitku.

Participants not able to join us online joined via

Discussion Summaries

Day 1: Learning to learn; Lifelong learning (continual learning and gradual learning)


Continual Learning has been an outstanding unsolved problem in AI for several decades now. Recently, plenty of works on continual learning for deep learning models have been published. The majority of research focuses on not forgetting, but continual learning is a broader topic, whose mastery may unlock human-level learning in machines. Multiple times in the past people have pointed to the ability to continuously acquire new knowledge and improve this ability on the fly as key towards achieving AGI, A Roadmap towards Machine IntelligenceGeneral AI Challenge – Round One: Gradual Learning, BADGER: Learning to (Learn [Learning Algorithms] through Multi-Agent Communication), and Building Machines That Learn and Think Like People, to name a few). 

We believe that although fruitful, the current trend in building bigger and bigger models that rely on huge datasets does in the end lead to systems that are universal and adaptive on the level where AGI will be tested. We invited the participants to a joint discussion on continual learning and encouraged them to come forward with their thoughts, fears, motivations, challenges and questions that they believe are important for the topic and can help it advance further. 

The top questions asked:

  • Team A: How do we make systems able to recombine skills to solve new tasks on the fly?
  • Team B: How do we achieve continual learning with no task boundaries?
  • Team C: How to meta-learn a replay buffer that helps generalization / robustness?
  • Team D: Could MCTS be used as a task invention framework?
  • Team F: How to provide environments with enough complexity to make life-long learning worthwhile?
  • Team J: Can we find a mechanism allowing an agent to plan its actions, while maximizing its learning?

Summary of outcomes

Team A

Question: Humans are capable of on-the-fly adaptation – we learn many simple tasks quickly, by combining previous skills, much more than specialize in a specific area. How do we make a system that is able to recombine skills to solve new tasks on the fly?

Many current AI methods try to achieve SOTA performance on a limited and well-defined task, resembling olympic athletes focusing 20+ years on specific tasks. But that is unlike the usual human experience. Every day, we learn something new, some small bits not necessarily related to our expertise, and we need to adapt to previously unseen variations of everyday tasks.

Hypothesis: A modular system with pre-trained abstractions will achieve a faster convergence to an acceptable level of performance compared to a monolithic system.

Such a system, given a new task, needs to just find the new sequence / transition probabilities (e.g. in a Markov process) of modules that work for the particular task – learning new tasks thus requires only a few extra bits and even the search space is small. Storing of the newly learned solution thus comes cheap and its retrieval is also easier. In line with the statement of the question, a minimum level of competence should be set, like a “minimal criterion”. Sequences might show certain variability or context dependency.

The idea above implies the need to tackle combinatorial complexity tied to the number of possible combinations of modules. One possible solution is to introduce a hierarchy of abstractions that allow the thinking about the solution of a problem to be simplified, broken down into smaller pieces, at the cost of decreasing the theoretically reachable performance of the solution. To give examples of where abstractions are necessary – an untrained human would be in a situation without required abstractions if you placed them in the pilot’s seat in the cockpit of an airplane, or a child behind the wheel of a car.

As for the tasks or experiments that should verify the hypothesis above, the general consensus was that robotic tasks with shared task features like e.g. MetaWorld could work well for extracting modularity.

Unresolved challenges: how to train the abstractions? Finding the connections between modules that are necessary, and building an abstraction, could be achieved by training a modular architecture on several different instances of the same task and taking the union (or possibly, intersection) of the learned topologies between the modules.

Team B

Question: How do we perform continual learning where there is no clear task partition, and instead just a continually varying context that does not immediately inform about the task?

In most continual learning settings some information about the underlying tasks are provided to the agent, but in the natural world, such information is not always, or even at all available. When we lack information about what tasks we are meant to solve, many issues start emerging, such as the difficulty with assigning some form of discrete labels to policies or structures that are being learned.

Summary of discussion:

  • What is a task? How do we define it and then reason about it in terms of discrete as well as a continuous spectrum? There are various ways of identifying a task, from labels, rewards to imitations and more, each with a different type of information that informs about a task. There are settings in which there are continua of tasks, such as Hockey or Navigation. There are also spectra of learning that range from unsupervised, Reward-based, semi-supervised, all the way to supervised learning and this can also serve as task information. The problem is that when the task is continuous, it makes no sense to associate the knowledge with the task – imagine a goal of reaching GPS coordinates.
  • Tasks representation can be categorized as intrinsic and extrinsic
    • intrinsic – continual learning is very hard here
    • external – not really a realistic continual learning setting
  • Progression of life can be seen as an interesting example of the evolution of task information that is provided to the agent, where when we are born, learning is unsupervised and limited task information gets through. As we grow, we learn more and more inductive biases and the amount and complexity of task information increases.
  • What is the simplest scenario that one can think of that represents this setup?
  • The system we design must be built with a ‘universal search capability’ that doesn’t break up through learning in future. Universal search capability can be inefficient, but fixed, and always present in the system in case the learned learning algorithm will fail. A simple example: A system is made up of two parts. “A” is the fixed universal search. “B” is a task-specific part. A trains B for any new task. System learns to solve task T1, only B gets rewritten, A stays with original capabilities. Then the system learns task T2, so B gets rewritten again, but A stays ready to learn any new task. We can say that this system has the capacity to adapt to any new task, without worrying about over-specialization.

Hypothesis: No hypothesis was proposed.

Team C

Question: How to meta-learn a replay buffer that helps generalization / robustness?

The following assumption underlies much of the discussion: “We should focus on a good representation, suitable for learning new knowledge (identifying new concepts, instead of relying on hand-crafted tasks), knowledge consolidation (feature shift), and combinatorial generalization (concept reuse). Are these mechanisms enough to achieve better generalization properties?

The following ideas were proposed as mechanisms for meta-learning of the replay buffer:

  • optimizing for memory loss in a network of deep artificial neurons (DANs)
  • use a noisy past experience to improve generalization
  • add more latent variables on the fly, to support continual learning

The following works were mentioned as inspirational for solving the problem: (although they use replay), (fast adaptation without replay), (Rapid Motor Adaptation, fast adaptation without replay again)

An obvious drawback of meta-learning a technique like the replay buffer are the requirements on computation and / or number of tasks – Luke Metz’s optimization on a thousand tasks was recalled in this context. Then again, representation learning would perhaps be better trained in an unsupervised or self-supervised way, to improve the strength of the feedback signal.

Hypothesis: No hypothesis was proposed.

Team D 

Assumption: Interesting tasks should lead to other interesting tasks.

Question: Could MCTS be used as a task invention framework? (It simulates what would happen if a task is given to a learner)

The following scheme was proposed as a task invention framework that proposes tasks most suitable for fast (learning) skill acquisition by an agent. The scheme is however costly to run.

  • A: An agent that has been trained on some tasks so far.
  • D: A distribution of anchor tasks
  • 1. Propose a new task T
  • 2. See how many tasks in D the agent A can learn to improve on in a few gradient descent steps.
  • 3. Train on T
  • 4. See how many tasks in D the agent can now learn to improve on in a few gradient descent steps.
  • 5. The difference is how good T is for the curriculum of that agent A

This scheme can be viewed as “empowering” the agent. The teacher chooses a task that allows the agent to be trained to do as many tasks in the future as possible.

A “latent abilities” algorithm was also suggested for detecting what agents can know how to do and what things are necessary to perform well in tasks.

Hypothesis: No hypothesis was proposed.

Team F 

Question: How can we provide environments with enough complexity to make life-long learning worthwhile? If environments themselves are evolving (or involve multi-agent tasks depending on the behavior of other agents), how should we measure agent performance?

Hypothesis: One can imagine three agent learning paradigms:

  1. An agent learning on fixed, supervised tasks, and to survive in an open-ended world
  2. An agent learning on supervised tasks, and to survive in an open-ended world with a curiosity-based reward, and
  3. An agent learning on supervised tasks, in an open-ended world with curiosity, and on a set of tasks procedurally generated by another learning agent.

One can then expect agents B and C to exhibit higher performance on a held-out set of hand-designed tasks. One can also expect the task-generating agent in C to generate tasks that are complex and diverse with respect to the initial set of supervised tasks.

Experiment: In this discussion, a hypothesis and an associated experiment were proposed that should be able to show that an agent trained via a curiosity-based learning algorithm is better suited for future tasks than an agent that maximizes an explicit objective. In this experiment, it is important to note that the environment is procedurally generated, open, and with additional constraints and pressures, with a crafting tree available. An important property of an open-ended environment was identified – to allow for the creation of tools that in turn may be used to modify and build new things in the environment.

Team J

Question: Can we find a mechanism to allow an agent to plan its actions, in a way that maximizes its learning?

When we humans decide to learn something (active learning) e.g. riding a bike, we already know several things which condition the learning process, such as whether the task is doable, some expectations about our skills and abilities, and what skills and abilities might be relevant for the new task. And we plan our actions to enable learning of the new skill. There are different methods and approaches that can be used to model similar processes and planning steps in our artificial agents, including MCTS, multi-agent competition, collaboration, and self-play and meta-reinforcement learning.

Hypothesis: There is a method that can maximize the learning speed of an agent by planning its actions and initial states, so that it acquires skills that favour its future learning.

Summary: Learning a skill is something that can be planned with some degree of knowledge and introspection. Selecting the right initial states and actions for quickly training a policy is a task that might be done by a 2nd level policy (meta-learned).

Ending word

Continual learning is a difficult topic that has many open questions and is challenged by limited clarity in definitions of scenarios and settings that are truly representative of how this type of learning occurs in the natural world. In order to progress, we need to be clear about our definitions, scenarios and tasks and how these relate to practical situations, beyond the methods we build that solve issues such as catastrophic forgetting as well as forward and backward transfer. In the practical sense, where possible paths towards general intelligence are considered, the most useful aspect of continual learning seems to be the ability to improve one’s own learning process through knowledge acquired through the learning process itself.

Day 2: Benefits of Modularity; Collective & Social Learning


Nearly all natural systems exhibit learning at multiple levels. Unlike in most current artificial intelligence systems where a global objective exists, natural systems often involve interactions between multiple agents or modules without such a system-wide objective, yet are capable of collective global behaviour. Examples of such systems in nature range from self-organization at the cellular level, all the way to collectives of bio-organisms or even human societies. In all of those systems, learning occurs via local interactions. The majority of existing AI systems, however, still focus on optimizing a fixed hand-designed loss function, rather than exploiting the power of collective learning, despite their frequent origins in biological inspiration.

Understanding the foundational differences between collective and monolithic systems, their drawbacks and advantages and when one system should be preferred over another is fundamental to understanding whether collective learning systems could unlock potential that existing monolithic systems are missing.

The top questions asked:

  • Team A: How to break down a complicated task to smaller tasks that can be solved easier than one big task. Is there a differentiable way to do this? 
  • Team B: Is there a type of learning that can emerge on a society level and wouldn’t be possible in a monolithic system?
  • Team C: How does the efficacy of a collection of modules or experts differ as the heterogeneity and granularity of those modules changes?
  • Team D: How to effectively search for a routing of modules avoiding combinatorial explosion. The assumption was that during training routing is changing much faster than modules itself which can be static or slowly changing?
  • Team F: Is there a general method as to how modules can be integrated together when they are on the spectrum between discrete and symbolic computation?
  • Team J: What is the minimal set of inductive biases needed for obtaining social learning and what is the difference between social learning at the agent and deep network levels?

Summary of outcomes

Team A: Breaking down a task to submodules

Question: How to break down a complicated task to smaller tasks that can be solved easier than one big task. Is there a differentiable way to do this? 

The following approaches were discussed:

  • Curriculum – If we phrase our problem as a curriculum, we can create modules by moving along the curriculum. The incremental differences between tasks relate to either incremental improvements of the modules or to the necessity of creating a new module.
  • Modules reuse in multiple tasks – If we operate over a distribution of tasks, we can define modules as an efficient way of exploring to improve over those tasks. The modules are reused in multiple tasks; we expect that there will be a correlation between a well-defined module and improvements of this module benefiting all or a large majority of tasks it appears in.

An experiment was proposed – to mitigate the modules being stuck in a bad Nash equilibrium, they can be forced to be able to cooperate with different modules. Modules that would not be able to adapt would be re-trained or eventually removed from the archive of reusable modules.

Team B: Emergent Learning at the Social Level

Question: Is there a type of learning that can emerge on a society level and wouldn’t be possible in a monolithic system? 

Key points of social learning:

  • External information storage – Is it the key for better collective learning? The storage can be cumulative and bigger than the memory of an individual agent.
  • Multiple feedback mechanisms – A social system can have many adaptive feedback mechanisms that will scale better than a centralized feedback of monolithic systems.
  • Efficiency threshold – Is there a threshold at which social systems become more efficient than monolithic systems?

Identified benefits of social systems:

  • Better scaling – modular / hierarchical systems with mostly local communication scales better than monolithic systems
  • Replication of skills – discovered skill can be replicated to another part of the society, whereas in a monolithic system it needs to be rediscovered
  • Open-ended learning – due to not having a single fixed feedback mechanism and the learning diverges, the social systems are more suitable for open-ended learning

Team C: Heterogeneity and Granularity of experts and behaviors

Question: How does the efficacy of a collection of modules or experts differ as the heterogeneity and granularity of those modules changes?

Two types of heterogeneity and granularity were identified:

  • Heterogeneity in the policies of modules (weights, number of parameters, etc) / Heterogeneity in the internal behaviors of the modules,
  • Granularity in the sizes of modules (fewer -> larger number of parameters) / Granularity as the number of unique policies that can be adopted by experts (Note this is similar to the heterogeneity definition above).

A number of experiments were proposed:

  • In a homogeneous module scheme, the robustness of the communication is an issue. Most SOTA methods use manual data augmentation to obtain diversity in the behaviors and therefore diversity in the gradients. Can we instead perturb the internal activations of the modules instead? We would be able to set the variance of the perturbations and explore the landscape of behaviors based on that varying divergence of behaviors.
  • Varying the degree of the number of policies that experts can acquire is also an experiment. We expect to see that this variation will require more noise in the communication, as there are more total parameters and the risk of overfitting is greater.

Team D: Routing of modules

Question: How to effectively search for a routing of modules avoiding combinatorial explosion. The assumption was that during training routing is changing much faster than modules themselves which can be static or slowly changing?

Four approaches were identified to fight the complexity of the search:

  1. Amortized Inference of slow combinatorial search. Neural proposal function informs search and learns from it a la AlphaZero.
    1. Pointer with program synthesis
    2. Pointer with modular meta-learning
  2. Hierarchy – Let’s have a library of modules that can be wired together and used in an agent. When some modules are wired together in the same way in multiple agents, they are encapsulated as a single compound module (including the inner wiring) and added to the library. 
  3. Informed search – add a label to each module describing its function. Then the task can be solved by decomposing it into sub-tasks until each sub-task can be implemented by a module.
  4. Pruning – Start with fully connected modules and enforce sparsity during the learning until sparse wiring is found.

Team F: Integration of modules

Question: Is there a general method as to how modules can be integrated together when they are on the spectrum between discrete and symbolic computation? Very large datasets can be trained with large monolithic systems, but smaller sets like Chollets ARC may need something like modules for effective learning.

But we might also want diversity in the substrate of those modules, for instance, ARC might be better served by a set of discrete rules, whereas MNIST can be more symbolic. The question is therefore how do we combine these modules together?

The experiment is to generate alien programming languages procedurally and use meta-learning to search for modules that can predict programs in these languages. The architectures of the modules will be diverse from neural networks, to integer programming, to cellular automata. A successful architecture from this would support the conclusion that diversity of modules and training methods is capable of processing complex symbolic logic tasks. 

Team J: Minimum biases needed for social learning 

Question: What is the minimal set of inductive biases needed for obtaining social learning and what is the difference between social learning at the agent and deep network levels?

Badger experts might have identical policies, but in any case, they need diversity somewhere, whether that would be in policy or input. However, it has been identified that too much communication can be harmful to learning.

Experiments here include adding noise to signals within some specified “noise budget” so that a module can choose how much information it wishes to communicate, and have agent pools with different parameters that have to compete to send and receive information.   

Ending word

The sessions of day 2 focused on modularity and social learning. The topics that were chosen explored a lot around how modules can be learned and connected. This indicates that there is still a lot of uncertainty around the potential mechanisms that could be used to induce modular experts. There were a number of experiments that were proposed as to how search could be performed over the module space.

On the social side, implicit biases and emergence were discussed. Experiments that test for emergence would likely have multiple feedback mechanisms and some external memory storage. Minimal biases that have been identified for social learning lie in diversity, without diversity somewhere in the system, there will not be any robust generalization.  

These topics are useful for Badger-like architectures in that the canonical badger system is one of homogeneous modules that communicate with each other. Investigations into how that communication can be performed as a routing task, and how the noise can be leveraged to promote diversity in behavior will be essential to further understand the implications of Badger.

Day 3: Open-ended exploration and self-invented goals


Day 3 featured the topic of open-endedness and goal creation. Much of what humans have accomplished is not about satisfying base-level urges or pressures, but goes far above and beyond anything that human evolution would have been exposed to. In that vein, we don’t just want artificial systems that do specific things which we tell them to do, but rather we want systems that discover on their own a wide variety of interesting things to pursue, including things we had not thought of or been able to imagine at the time of building the system.

To this end, the concept of open-endedness leads us to desire a variety of things – an AI that won’t saturate or finish learning, but rather learns forever; an AI that can continually surprise us with novelty; an AI that centers its own interests and capabilities and pursues them even in ways we could not anticipate when we first trained or built it. To this end, the question of how agents might determine their own goals in a coherent way is central to open-endedness, as they must be able to move past any finite specific set of guidance we initially give.

The groups came up with numerous questions and ways to think about these problems, which roughly can be grouped into three themes:

  • What is open-endedness? What do we want out of it? How do we recognize whether we have it, or if it’s even possible? And how do we guide it to be useful without destroying it?
  • Does the ability of an agent to introspect and model its own learning process give rise to a natural and open-ended source of direction for it to pursue over the space of goals, and how might that work?
  • What aspects of environment design and interaction are conducive to open-endedness, and how can we make environments where it’s even possible for what an agent discovers to surprise us when we have such a strong bias for thinking in terms of tasks and anticipating solutions? Can environment design be automated to create the potential for things like tool use, or could we even give responsibility for the environment to a society of other agents all supervising each other?

The top questions asked:

  • Team A: How to test whether a system has an open-ended potential?
  • Team B: How useful is it to know what you know? 
  • Team C: How can we create open-ended systems where the agents can surprise us?
  • Team D: How to introduce goals to open-ended learning? 
  • Team F: What makes a desire (goal, motivation, etc) good from an agent perspective? 
  • Team J: Why are self-invented goals not solved yet? 

How can we know Open-Endedness when we see it?

Team A asked whether it would be possible to test a system for its open-ended potential without fully simulating it out – is it possible to prove that some systems cannot be open-ended, while others might have promising ingredients. They identified a number of properties that would exclude open-endedness outright: convergence to a static state, lack of interactions between elements, failure of the phenomena in the system to scale with system size, and a dominance of random behavior over dynamic structure. Similarly, they considered that some things might be indicators of open-endedness: multi-scale patterns, the presence of replicators, and increasing diversity. Team A proposes to investigate heuristics meant to determine whether a system might be open-ended by scaling the system and seeing how the different heuristics behave under scaling.

Team D similarly faced a question of different kinds of open-endedness, but from a point of view of wanting to drive the system with goals. To that end, they considered a form of open-endedness where the goal is a fixed constraint, but the methods of solution of that goal would be diverse and complexifying.

Both teams considered using CAs as a test-bed for these ideas, with Team A proposing to test known 1d Wolfram CAs that had been judged to be Type 3 to see if patterns emerged when the system was scaled up, to see whether human intuitive classifications of open-endedness are trustworthy as things go to scale. Team D also considered CAs, but in the context of searching a rule-space for multiple solutions which all satisfy some constraint (such as producing a particular pattern at least once during a run), and then seeing how the fraction of the rule-space which remained viable could change over the course of exploration.

Team C thought of open-endedness a bit differently, centering the question of surprise: what would allow a system to continually surprise us? They concluded that as long as we are thinking in terms of specific tasks and environments, it’s likely that even an open-ended agent might not be able to surprise us because we would have a tendency to bake in our own solutions and simplify away things which we did not think were relevant. But, if there were a space of environments, they hypothesized that one could look for things that would allow agents to increase their empowerment by making changes to the environment, and that that would be an indicator of situations that could support emergent structures such as tool usage. Team C went strongly in the direction that the ability to build tools to resolve different task bottlenecks could be seen as a very good indicator that the environment supports open-endedness.

Knowing what we could know but don’t

Another repeated theme was the idea that goals might derive from agents introspecting on their own learning process. In Team B, this was phrased as the question ‘What should you learn?’, with the idea of looking at agents empowered to choose which tasks to attempt and how their behavior would change with different levels of meta-information provided. Do agents do better when task meta-data is provided as input? How about if they’re explicitly informed about their own performance on past tasks, or their improvement on past tasks? The idea was that agents could learn to use this explicit meta-knowledge to not only better pick tasks they were likely to be able to learn to solve, but also to discover a natural curriculum for themselves.

Team F came to a similar conclusion through a different direction, looking for alternate ways to formulate the ‘action’ of picking a goal that would avoid a sort of laziness that plagues self-determined goal systems. The problem is fundamentally that if you get to choose your own goal, why not just pick the thing that gives the highest reward for doing what you’re already doing? They proposed a system using three modules: a goal-conditioned policy network which attempts to execute goals, a module that predicts whether or not a given goal can be successfully accomplished given the agent’s current state of learning, and a module which predicts whether that goal could be accomplished after further training. The agent would be motivated to choose goals that maximize the contrast between how well it could do given training, and how well it thinks it could currently do. In other words, the agent would get bored with things it knows that it knows how to do, but also wouldn’t be interested in things that it thinks are impossible.

Team J noted that there are a lot of frameworks for self-proposed goals, such as instrumental sub-goals arising from empowerment maximization, or things such as Schmidhuber’s Power Play or Learning Progress (which itself is similar to Team F’s proposed method). However, these things are often hard to generalize or evaluate for long horizons or open-ended systems where the environment could change drastically over the course of a run. They proposed a world model type of approach to learning a high-level representation of distinct goal states, coupled with a network that would predict whether the agent could attain that state – the reward driving the agent would be the inverse of the probability predicted by this network, driving the agent to try to attain states which it thought were a-priori unlikely that it could achieve.

There’s a common thread here that in some sense it’s the turning of the learning process towards its own ends which might underlie AI open-endedness, in a variety of potential forms.

Environments are other people

Another common thread among the groups was the idea of relying on the environment to provide the impetus to choose goals in pursuit of open-endedness. In some cases, this was borne out of necessity due to the difficulty of an environment designer avoiding baking in specific bounded solutions (Team C), whereas in others there was the thought that environments composed of other agents, with agents determining goals for each other, could prevent the collapse to triviality and provide emergent drivers (Team F). 

Team C’s take on environments was to ask what it would take to create environments supportive of tool use without baking in the definitions of those tools by hand. They considered the possibility of developing a metric for ‘compositional complexity’ that would measure how well the space of possibilities within an environment could be expanded by putting sub-elements of that environment together, creating a sort of empowerment motivation which would be used not for the agents, but to invent the environments in which those agents would be trained.

In Team F there was an idea that perhaps things extending beyond a single agent were fundamentally necessary to specify open-endedness, and that no self-only objective could do the trick. Instead, ‘distributional’ objectives: comparing the self to other agents, trying to maximize diversity at a society level or to stand out and garner the attention of other agents. These sorts of approaches centered the idea of the other agents in a social context acting in some sense as an environment for each other, with society-scale forces such as Watsonian Natural Induction driving the open-ended selection of diverse goals. For the group, this felt in some ways more true to the experience of an artist than something like the alternate idea of seeking out learning opportunities, and the idea of a ‘social media likes game’ was raised as a possible implementation of this view. 

A key thought with regards to the emergence of higher-level goals was whether it might be necessary in general for agents to be faced with goals that were inherently bigger than themselves – not just that the agent wanted to achieve a certain state, but that the agent wanted other agents to achieve some states, leading to a cascade of larger and larger coordinating groups. In order to achieve “goals bigger than themselves”, team F also hypothesized that this might be brought about simply through the minimal criterion of survival/reproduction through the tension between collaboration and competition: in an environment where many agents compete for a limited carrying capacity, those who form collaborative groups will outcompete the individual agents, but to form stable collaborative groups they need to develop mechanisms which suppresses within-group competition. Once such a mechanism exists, the only way to forward your own interest is to forward the interest of the collective, thus having a goal that is larger than yourself.  

Ending word

As always, when people get together to talk about open-endedness, there are a diverse set of definitions and goals among the participants. But despite that, there was a sense of the importance of advanced awareness of ‘what might be possible’ across what the various groups said – what is possible for an environment, a rule, or an agent? Perhaps this meta-cognitive awareness ties in at a fundamental level to the need for a pursuit of open-endedness to anticipate and guarantee things that cannot be yet known.

Day 4: “Manual Badger” VS “Automatic Badger” approach


This day featured discussions regarding the differences between possible approaches to building collective learning systems, with a particular focus on the manual and automatic views on developing the Badger architecture.

Difference between Manual and Automatic Badger:

  • both are multi-agent systems; therefore, social learning is in play
  • in Manual Badger, no outer loop that learns the expert policy exists
  • the expert policy is handcrafted, and there’s a never-ending inner loop

One can think about the difference as ‘Cellular Automata with somewhat fixed update rule’ (Manual Badger) versus ‘Neural Cellular Automata’ (Automatic Badger). A specific instance of Manual Badger is what we call Memetic Badger.

Memetic Badger – a simplified description:

  • inspired by how memes spread in society
  • Cellular Automata with fixed update rule, but which can be dynamically reprogrammed by memes uploaded into it
  • each cell can store a limited number of memes
  • memes inside the cell vote for which new meme can get in/out
  • memes want to reproduce (spread, propagate)
  • we hypothesize that this is a system with evolutionary dynamics (selection, variation, and heredity), and it would evolve in an open-ended manner, not needing outer-loop optimization, just memes competing for limited resources (cells)

Useful analogy: meme is a program, cell/expert is a CPU/RAM

Manual Badger approach forces us to closely examine what is happening in the inner loop instead of the outer loop optimization.


  1. Is the Manual approach (throwing away DL) a safer approach than Automatic? (faster, more interpretable, easier for humans to understand, etc.)
  2. What are our expectations about the length of the inner loop? Short, long, infinite?
  3. Any suggestions on literature and existing simulations (e.g. Avida, Thierra, etc) and how to compare them with Memetic Badger?
  4. Please feel free to have any other unbiased questions!

The top questions asked:

  • Team A: Given agents with sufficiently powerful world models, what do we need to add to get memes / memetic evolution among them?
  • Team B: Can emergence and open-endedness be trained for via learning or meta-learning?
  • Team C: What are the proper inductive biases which make manual badger work? Could Badger benefit from an explicitly hierarchical/small-world-network connectivity?
  • Team D: How do we design things to scale?
  • Team J: Will automatic badger be always convergent to a converging expert policy and not divergent in the inner loop?

Summary of outcomes

Team A 

Question: Given agents with sufficiently powerful world models, what do we need to add to get memes / memetic evolution among them?

The team noted that agents can model the world and other agents too, doing so through a world model. Agents therefore observe events happening in their surroundings (including what other agents do) and this might be a source of interesting events. This way agents can learn interesting things by copying other agents’ behavior. This fact led the team to realize that a behavioral meme is not being spread deliberately, but just by one agent imitating another agent’s behavior.

The team suggested two hypotheses.

Hypothesis 1: Having agents with the ability to model their own learning process is a sufficient condition for memes emergence. 

If an agent observes something in the world and concludes that this is something that it can learn to make happen, then it is possible for an agent to reproduce another agent’s behavior, and memes can spread. Copying per se is not necessary. The agent only needs to infer that things it observes are (or might be) things it can learn to do. Then it can figure out how to do them for itself. Imprecise copies are possible and they facilitate evolution of memes. Imprecision and even improvements are to be expected, since each agent is (at least partly) figuring out the behavior for itself rather than naïvely copying it. This property actually gives robustness to the memes – the meme’s survival does not depend on an exact copy, but it can be reconstructed from even an imprecise description / perception.

Hypothesis 2: Without the need of a world model, a copying meme will emerge even given very weak conditions – if the agent is able to act on its own and can be influenced by external actions.

Team B

Question:  Can emergence and open-endedness be trained for via learning or meta-learning?

The team discussed if emergence, collectivity, open-endedness and self-organization could be learned. They noted that the emergence of interesting behavior is connected to manual badger and it is important to understand if there are some inherent limitations to either manual or learned approach. 

The team discussed about the differential elements of a collective with respect to a monolithic system:

  • Centralized vs Decentralized objectives
  • Monolithic system with different parameters everywhere -> diverse system
  • Introduction of any shared/common structure tends towards a population view
  • Directly optimizing a global objective vs indirectly via local objective (even oblivious to global)

This resulted in many new questions: is it possible to (learn to) optimize a global objective without individual agents being aware of it? Does this always happen via communication (direct or indirect)? How can we learn to generate global emergent objectives via local objectives only? Can local objectives alone lead to hierarchies?

The team also came to the realization that a collective is a set of modules with perturbed objectives, and that larger perturbations can lead to more interesting behavior (e.g. fish objectives are much less diverse than of e.g. humans).

Team C 

Question: What are the proper inductive biases which make manual badger work? Could Badger benefit from an explicitly hierarchical/small-world-network connectivity?

The team discussed the importance of connectivity for finding collaborative solutions. This leads to the necessity to distinguish between memes and messages, concluding that memes are a union of language and messages. What forces the collaboration between experts (instead of each one solving it individually)? What makes it easier to find a collaborative solution than an individual solution? Interestingly enough, a recently published paper was brought up, noting that human babies see very blurry and this forces their brains to integrate global information, instead of focusing on local textures (as convnets tend to do).

The team suggested several hypotheses regarding properties that might encourage collaboration, such as enforcing a small-world or hierarchical connectivity, relying on an efficient compositional language, or distributing the input information so that it must spread over the whole network in order to be integrated.

These hypotheses lead to possible experiments. Some of them focused on testing mechanisms that favor collaborative solutions, like randomly dropping out experts, disabling communication and gradual constraints on connectivity, either directly applied or indirectly via a loss function. They also suggested investigating the effect of different communication topologies (flat, hierarchical, random, etc.) and types (discrete, leveraging a predefined vocabulary, compositional, etc.), and how different ways of handling the memory would impact performance.

Team D 

Question: How do we design things to scale?

In some cases we can prove scaling by construction, e.g. random search can eventually solve any finite problem. However, this may not be the kind of scaling that matters. More complicated senses of ‘scaling’ are of practical importance, for example solving things with a polynomial resource cost in terms of some reasonable measure of problem size or difficulty. These things are hard to prove or build towards in general, so we thought of empirical ways to approach them. One idea of practical value was to find some sort of correlate or probe of scaling, since directly measuring scaling is exponentially expensive.

To this end, we thought of e.g. whether different amounts of noise in a ML problem would map to different sizes of noise-free problems in terms of the parameter count needed to solve them, enabling one to estimate whether an approach could solve very large problems without having to actually go there.

We also thought about the source of the sort of efficient scaling of methods to problems. Often it arises from modularity, e.g. in discrete combinatorial optimization you need to be able to break things down into sub-problems. This seems as though it might be related to having an auxiliary direction where you can invest more resources into a solver to convert a problem into an easier one – this is kind of what we want the Badger experts to be. For example, in neural networks this is increasing the number of parameters in order to make the landscape more convex. In other problem domains, it seems as though we can find this too: solving spin glasses via replica exchange (adding extra phantom degrees of freedom, to make a manifold of related problems) or in constraint satisfaction by solving versions of the problem with some constraints omitted in order to narrow down the search space.

Team J 

Question: Will automatic badger be always convergent to a converging expert policy and not divergent in the inner loop?

Thinking about this question first led to thinking that there might not be necessary to have an outer loop and that having a small set of basic mechanisms in the inner loop might be sufficient. Inspired by evolutionary competition and natural induction, they proposed a series of candidate mechanisms and hypothesized that they could lead to complex and diverse behavior in the memetic badger.

One of these mechanisms was an extremely simple expert policy, which should install any incoming meme and let them decide, by vote and conditioned on the input, their eligibility for outputting an action and which meme to discard. Following the principles of natural induction, the expert would always discard a meme in favor of a new one, and it would select one randomly if there was no voting decision. They also noted that imposing limited resources was key, as it is what enforces search efficiency and enables behavior shaping. Another mechanism in this direction is having distinct experts access to different resources e.g. input, output and internal/processing experts.

They proposed to test the success of these mechanisms on a set of diverse but relatively simple tasks. Each task would be specified by a domain specific language (DSL) and a compute base, being memes represented as code. Some essential behaviors, which are expected to happen during training, would be handcrafted (e.g. reproduction).

Ending word

A: Behavioral memes spread naturally, by observation and copy. Role of world models and self-modeling in learning, and how copying imprecisions lead to meme robustness.

B: Manual badger is important for knowing more about open-endedness and emergent behaviors. Collectives have some key differences with respect to monolithic systems.

C: Importance of connectivity for finding collaborative solutions. Noisy or partial local information leads to the integration of information at a global level. Exploration of different connectivity patterns.

D: Scaling can happen in a brute-force manner or in terms of compute complexity. Estimating whether an approach can solve very complex problems without having to actually do it. Efficient scaling often arises from modularity and breaking problems down into sub-problems. Investing more resources (initially) for converting a problem into an easier one e.g. increasing parameters for increasing landscape convexity. Solving simplified versions with omitted constraints (i.e. lateral thinking?) to narrow down the search space.

J: Manual badger might find success from a very small set of simple rules. Represent memes as code and let natural selection and induction do the rest, co-evolving more complex mechanisms. Selection (and behaviour shaping) through energy and expert cooperation as a way of survival maximization.

Some discussions centered on how manual badger is relevant for knowing more about open-endedness and emergent behaviours. Collectives have several key differences from monolithic systems, among them is the finding of collaborative solutions and the ease when it comes to scaling up the system. In this sense, connectivity is a very important factor and one of the teams was interested in exploring different connectivity patterns. They noted that having noisy or partial information at the local level leads to integration of information at the global level. An additional observation was that behavioural memes spread naturally through observation and imitation, if world models enable it, and that imprecisions lead to meme robustness.

However, scaling can happen in a brute-force manner or in terms of compute complexity, being the latter more interesting. One team discussed surrogate ways for estimating whether an approach would scale in complexity and how efficient scaling is often a byproduct of modularity and problem subdivision. They realized that some methods have certain computational overhead, but in exchange they are able to convert a problem into an easier one. One example is the increase of landscape convexity via increasing the number of parameters. Another team observed that one possible way of scaling manual badger might come through implementing very simple rules, representing memes as code and letting natural selection and induction do the rest.

Day 5: Open problems in (general) AI


The topic of day five was “Open Questions in General AI.” We find it instructive to revisit the question of what the major remaining challenges are and how to approach them. In particular, we want to understand the limitations of current methods and how they can be overcome. 

The discussions focused on several topics including the definition of AGI, metrics for measuring it, open-ended innovation, learning without reward functions, and efficient transmission of knowledge about abilities between agents. 

The top questions asked / problems stated:

  • Team A: Limits of engineering approaches
  • Team B: Beyond reward functions and goals
  • Team D: AGI – definition and difference to narrow AI
  • Team F: Memes: from behaviors to concepts 
  • Team J:  Metrics for measuring AGI

Summary of outcomes

Team A: Limits of engineering approaches 

Question: The ‘general’ in AGI extends to problems we don’t even know how to frame. Is there a fundamental limit to how far engineering approaches (benchmarks, quantitative evaluation) can take us, and how do we get past it?

Arguably, general AI does not have a well-defined metric. A way to get around it is exemplified by GANs: “Let’s throw out the principles of quantitative evaluation and create this system – if it works well, we’ll work on it more.” We could get to AGI by repeatedly applying this approach.

To be able to discover fundamentally new things and opportunities, we need a diversity of approaches and an openness to the detection of potential ‘door-opening states’: innovations that admit successful exaptation.

Team B: Beyond reward functions and goals

Question: Is there a way to move beyond reward functions? Can we make agents that “try” to do things, in any way other than telling them to maximize something?

A possible way forward may involve local, Hebbian-like rules that make the agent reproduce past behavior and allow the agent to compose new behaviors from existing ones. Other possibilities are emergent rewards, minimal criteria applied to behavior, and Hebbian-like learning with correlations propagated backwards in time.

Team D: AGI – definition and difference to narrow AI 

Question: What is the difference between general and narrow AI?

One possibility is that there is not a big difference between the two: a general AI could be just a big collection of narrow solutions and efficient learning algorithm(s). 

A possible notable difference between current, narrow AI systems and general AI is that general AI should be able to create an abstract representation of the situation, so that it is able to reason about it efficiently. Furthermore, in such a system the abstract representations should not be strict, but should be more flexible and context-dependent. Other limitations are possibly in the lack of efficient memory mechanisms (that support lifelong learning), lack of support for the sense of ordering and ability to make meaningful hierarchical representations. A need for communication in a group of agents could encourage the emergence of more structured latent representations.

Team F: Memes: from behaviors to concepts 

Question: How can we get from agents that can copy each other’s external behaviours (behavioural memes) to agents that copy concepts / internal “thought” processes? 

If agents are able to share knowledge about their abilities, this can greatly improve the speed of learning, since other agents could focus on sharing mainly the achievable goals, as opposed to random choice of new, possibly non-achievable tasks.

Two main types of knowledge sharing were considered: sharing knowledge by demonstrating behaviors to others and using language to share information with other agents. Assuming the agents have a sufficiently expressive world model, in order to be able to receive a meme, the agent needs to:

  • be able to understand the ability that is being transmitted (either the message or behavior),
  • encode the concept of the ability in some latent format and set this as own goal for a new sub-policy,
  • start learning a policy that tries to achieve the goal.

This is a sketch of a mechanism that allows the agent to transmit abilities as memes in society.

Team J: Metrics for measuring AGI 

Question: Which metrics do we have and what would be necessary to evaluate AGI?

AGI is difficult to define and we do not just want to restrict it to human-level intelligence. 

One way to measure the intelligence of an agent is to measure its ability to solve problems and generalize over various sets of problems. However, at some point humans will not be able to come up with new problems or even judge an agent’s performance. In those cases it might be helpful to define a problem’s difficulty by the degree to which other agents are able to solve that problem. 

Another way would be to let agents explain the problem to other agents and measure the effectiveness, i.e., length of the explanation. A more intelligent agent should have a better understanding/compression of the problem and should therefore be more efficient at explaining it. 

Ending word

One of the common topics is that the definitions of terms used here are vague (e.g. intelligence and general intelligence), therefore it’s hard to judge the progress in these directions. On the other hand, sometimes it’s good not to focus on the goal too much, exactly as the AI should not focus on optimizing one objective, but rather explore a wide variety of interesting behaviors in an open-ended manner. 

Leave a comment

Join GoodAI

Are you keen on making a meaningful impact? Interested in joining the GoodAI team?

View open positions