GoodAI recently hosted a virtual workshop with a number of external collaborators in order to address some of the crucial open questions related to our Badger Architecture. The workshop was the second of its kind (you can read a report of the first workshop here) and many of the questions stemmed from the first workshop. These events were online and included participants across the world. Despite their private nature, if you are interested in the topic or in participating in future events of this type, do not hesitate to contact the GoodAI team.
After organizing the first workshop, we wanted to make the next round even more productive and focused, ideally ending up with concrete testable hypotheses, tasks, experimental designs, and architectures.
To achieve this, we started by putting together a questionnaire for all participants in which they had a chance to provide feedback from the last workshop and highlight primary discussion topics of interest, based on outcomes of the previous sessions.
We decided to focus on only a few topics, with repeat sessions during the two days to give everyone a chance to discuss a topic, digest it, and follow-up the next day. We wanted many of the participants to actively work on preparation for the respective topics, to engage even more thoroughly, and come maximally prepared. We ensured that each of the three identified topics had at least four active participants, who took it upon themselves to prepare discussion and background materials as well as guide and foster deep discussions during the workshop sessions.
After the workshop, the participants then spent two weeks thinking about the discussions and collated a collection of experiments with testable hypotheses that will be used as a basis for further research into the Badger architecture and related concepts. Future blog posts will touch upon some of these identified topics and experiments.
During the first workshop, we have identified a number of pressing questions and topics worthy of further discourse. From these, we have distilled three topics to focus on during the second part of the workshop:
- Topic 1: When is modularity, collectivity & multi-agentness beneficial?
- Topic 2: Something out of nothing?
- Topic 3: Challenges of the Inner Loop
Below, we outline the posed questions and summaries for each of the three topics. Each section also includes a link to a Notes document that contains a wealth of other information in a less polished form, yet with other potentially valuable information.
Topic 1: When is modularity, collectivity & multi-agentness beneficial?
There seem to be at least two views on the topic of multi-agentness/collectivity in badger, depending on whether we are talking about the potential of badger vs. the technical details of how learning in badger occurs, i.e. the why and the what vs the how (Marr 1982, Hamrick & Mohamed 2020).
Why & what seem to be primarily viewed via analogies to either biological systems, such as collections of neurons in the brain and their collective computation, or social systems where groups of agents act collectively as we humans do, for example.
How, on the other hand, is more technological as it is more about the substrate in which badger lives in our simulations, e.g. a collection of recurrent or transformer modules connected together in some sense (statically or dynamically) and the impact of this substrate on the ability to learn within it at all, let alone learn some relevant complex dynamical structures, such as learning algorithms themselves.
The above two views are potentially so different that they warrant different discussions, e.g. mathematical optimization view vs social science/economy view. Both views are very important, but before we are able to resolve the technical issue, the “dream” view might stay as such for some time.
- What are the benefits of distributedness generally and in the Badger architecture?
- Why does communication/information transfer work well in neural networks, but not necessarily in Badger?
- Where is the transition from monolithic systems into modular/multi-agent ones?
- How can we achieve the benefits of collective decision-making akin to the ones observed in the NASA experiment (Watson and Hall 1970)?
- How can hidden information games help?
- Can the benefits of distributedness be studied separately from the meta-learning part or not? If not, why?
- Similar to 1, but narrower: what is the benefit of dividing the computation into blocks, and how can we calculate the trade-off with respect to constricting communication between threads?
- Is there a market-like structure where agents profit even more when cooperation is successful (so that they try to convince others via communication) but can still profit from their own predictions if others can’t be convinced?
- What’s possible with heterogeneous goals/rewards?
The Dream (Why & What) – the potential of multi-agentness and collectivity
- There is a huge collection of relevant resources and literature from various fields, Neuroscience, Economics, Social Sciences, Biology, Cognitive Science, Psychology
- Ideas from all of the above can be relevant and interesting, but are not trivial to implement and within a system like Badger might not always be directly applicable
The Reality (How) – the substrate and its impact on learnability and achieving the “dream”
- As discussed at the last workshop, multi-agentness might be a necessary evil
- But can we find where it can help at the substrate level? There are many areas which can shed some light on this topic:
- Distributed Optimization, Ensembles, Niches, Herding, Swarm Intelligence, Distributed/Decentralized Artificial Intelligence, Distributed Problem Solving, Federated/Collaborative Learning, and others
- The power of the masses vs. the power of individuality (Krakauer et al. 2020)
- i.e. when do collective systems benefit from masses (e.g. ensembling) rather than useful fusion of disparate information by individuals with collectively beneficial information/knowledge (e.g. Watson and Hall 1970)
- We need to remember the difference between a multi-agent system and its properties and learning in multi-agent systems. Similarly, collective computation and collective learning are different things and most likely should be treated as such.
Despite the fact that a number of very interesting questions were posed (see the previous section), the focus of the discussion ended up primarily around the following questions:
- Why focus on a modular system (as opposed to a monolithic one) and what are its benefits and drawbacks?
- How to show/test the benefits?
During the discourse, we have identified the following benefits:
- Ability to solve tasks neural networks are not capable of
- For instance, tasks that have a variable size of inputs / outputs (e.g. modular robots)
- Good scaling properties hypothesis
- Resulting Badger network should have the ability to scale well with the number of Experts available. After adding new Experts we should be able to see faster convergence or better precision of a solution.
- Shared weights benefits
- Ability to use sparsely activated Experts. We assume the following benefits of allowing some Experts to stay inactive, we assume:
- The system is better at avoiding catastrophic forgetting (as in e.g. ANML).
- Experts will naturally specialize and form modular (disentangled) representations (as shown in beta-VAE, RIM) which generalize outside the training distribution well (as argued in e.g. Consciousness prior, RIM, Neural EM and Meta-transfer).
- Easier coordination between experts – for example less noise in the communication channel (as argued in the IC3Net vs. CommNet architectures).
- Activity of other Experts can be used as an auxiliary source of (input) signal ~ lateral context (e.g. Social influence paper)
- Better robustness of the whole system
- Since there should be some variability in the Expert policies (and outputs) the whole system should be more robust, similarly to ensemble learning approaches.
There also drawbacks of a modular approach compared to a monolithic one, these include:
- Multi-agent reinforcement learning (MARL) approaches are usually used for solving multi-agent tasks
- Modular systems are less efficient than monolithic ones
- For instance, a network with holographic memory will have bigger capacity than a modular one, while using the same resources.
- More complicated to train
- It is hard to choose the right architectural biases (e.g. size of Expert, communication topology etc) so that the system converges equally well as a monolithic neural network.
- The shared-weights assumption might also cause some strong local optima/saddle points, which might make convergence harder.
At the end of the workshop, the following hypotheses were proposed to be tested:
- Assumptions about good properties of sparsely activated Experts
- Addressing the question of how to decompose the input space amongst Experts to allow modular processing
Topic 2: Something out of nothing?
The key point of this topic is to figure out how to take advantage of thinking time which is not tied to a specific task. There are various examples of this in human society and endeavors – mathematicians propose axioms and explore their consequences, programmers might write and improve algorithms independent of the actual production use of those methods, and can then share them with each other. Even when someone is playing a game such as chess, there is some benefit to spending more time thinking about the next move.
This is a bit of a paradox in the view of neural networks as performing statistical inference. In that inference view, it’s all about mutual information between the data that the network receives about the world, and the thing that the network wants to decide. From an information theory point of view, spending longer thinking cannot increase this information.
How should we understand the benefits of thinking longer so that we can take advantage of it in our architectures?
- How to correspondingly discover new knowledge and algorithms in the inner loop?
- Internal thought process (computation), deliberation, feedback and System 2
- How does it relate to open-endedness, curiosity, generative processes, etc.?
- Inference vs Learning of new processes
- The isolated AI scientist / mathematician analogy
- Why are we able to do it?
- What is not sufficient on MetaGenRL (while omitting multi-agentness)?
- Is planning a necessary component that enables creating “something out of nothing”? E.g. Monte Carlo tree search?
- (Maybe similar to 2?) What are the problems being solved by this (e.g. by math, language-driven thinking)? Is it the construction of new theories and the exploration of their consequences? Is it digitalization for the purpose of enabling long-distance / long-term communication with ourselves?
There are two potential resolutions to the paradox described above. One resolution is that the limitations of cognitive systems may be computational rather than informational – that is to say, there is some mutual information that could be accessed asymptotically, but under the computational limitations of the architecture this information is not accessible immediately. Instead, the representation of the information must be successively transformed in order to expose that hidden information. In this case, we can use the framework of ‘Usable Information’ (Xu et al. 2020) to understand how these quantities are transformed under successive computational resources being applied.
Another potential resolution is that the cognitive system is in fact generating information, but it’s not information about the external world – its information that has to do with how it organizes the processes of its own cognition. An example of this would be that, given a very small amount of seed information (the specification of a cellular automaton for example) you could generate a large number of visual patterns, then use those internally generated patterns to pre-train a vision system. However, doing this requires sampling from the space of patterns, and every time the system randomly samples it is actually generating entropy. Another example of this would be a search or planning system such as Monte Carlo tree search applied to a game. The algorithm samples random playouts and thereby is increasing the information it has about something, it’s just that the ‘something’ isn’t the random variables that comprise the external world but instead it is about the possible counterfactual ways that the game could play out.
As a result of the workshop discussion, we proposed a number of experiments to address these two hypotheses.
One is to investigate the process by which search-based program synthesis could be accelerated by internally posed toy-problems. The idea is that agents working out solutions to problems they pose themselves can develop sub-programs which are inherently useful (things like sorting algorithms, for example), exchange those sub-programs, and thereby see improvements on an external task without needing to interact with it. We should expect various kinds of scaling: scaling with the time spent self-training, scaling with the number of agents sharing their discoveries, and finally scaling of the time spent on search with regards to the task itself.
Another experiment is to formulate an ‘accessible entropy’ that measures the degree to which a limited computational family can extract information from a representation of a random variable independent of the task or terminal variable that it is being used to predict, and then see if we can train a module to successively increase this accessible entropy. Then, if we have such a module, do we see improvements in performance with the number of times this module is applied (e.g. the length of the thought process to organize the information)?
A third experiment has to do with compressibility. If a cognitive system is generating new bits of entropy that are meaningful towards its performance (but don’t encode information about the outside world) then its hidden state should become less compressible over time as it thinks, where compressibility here is measured as the degree of lossy compression that can be applied before the performance drops by a certain amount.
During the discourse many other ideas and proposals were discussed. Details about these, as well as other useful snippets of information, can be found in the topic notes here.
Topic 3: Challenges of the Inner Loop
Inner loop learning remains a challenge. In this topic, we explored the motivations for learning in nested loops, how to control what gets learned in the inner loop, how to scale up inner loop learning, and related questions.
- How to force skill learning in the inner loop, rather than the outer loop?
- How to scale from a fixed size inner loop, to an open-ended never ending inner loop? What kind of inner/outer loss can test for this?
- Are hidden states/activations of a recurrent NN enough to feasibly allow for all the above?
- Is it really important where the skills were learned? Humans are not made from general experts; we are pre-booted with experts specialized to learn language, social interaction, vision, etc.
- Can SGD still work over millions of steps inside the inner loop? Probably no. Can this be solved via auxiliary losses?
- Can the above be achieved using auto-generated tasks, e.g. by POET?
- What is the importance of algorithmic choice on both loops?
- What’s the relation of this topic to the “Something out of nothing”? Do the inner loop problems limit the complexity of the experimentation of the agent/experts?
Questions on which the discussion primarily focused:
- What are the motivations for the outer loop/inner loop structure?
- How can we force skills to be learned in the inner loop?
- How do we scale up to long (millions of steps) inner loops?
- The need for inner loop learning may be related to computational bounds and to changes in task distributions; there are some analogies to human knowledge acquisition and to the need for deliberation (Topic 2).
- Bottlenecks can be applied to control where things are learned. There are a number of ways to do this. One is to withhold some types of information (such as sensory data) from the outer loop. Another is to provide the outer loop with such a wide range of tasks that it can only learn very general strategies.
- A highly diverse meta-learning task suite (e.g., with Omniglot, tiny ImageNet and MuJoCo) could be tried with small learning architectures.
- We could support very long inner loops if we only need to backpropagate a few steps on the winning path; we already do something like this with program synthesis. If ten out of a thousand modules are selected for the winning policy, we backpropagate through ten modules only.
- We could train long inner loops using short inner loops – e.g. learning a general for-loop using iterations of a few steps.
- We may need some guarantees on convergence if we want to extend the inner loops beyond what was seen in training.
GoodAI: Olga Afanasjeva, Simon Andersson, Joe Davidson, Jan Feyereisl, Nicholas Guttenberg, Petr Hlubucek, Martin Poliak, Marek Rosa, Jaroslav Vitku
External Collaborators: Kai Arulkumaran (Imperial College London), Martin Biehl (Araya), Sergio Hernández Cerezo (HCSoft), Kevin Corder (University of Delaware), Guillem Duran Ballester (Fragile Tech ), Petr Simanek (DataLab, CVUT), Zheng Tian (UCL), Olaf Witkowski (Cross Labs)