Marek Rosa recently spoke at the Game Developer Session (GDS) conference in Prague. Taking examples from GoodAI’s in-the-works AI Game, he demonstrates the potential of utilizing an AI-driven approach to NPCs. Non-scripted behavior, complex and richer game personalities, and more emergent gameplay are among the benefits of integrating large language models in game development.
Watch the video or read the transcript below.
This transcript has been edited for length and clarity.
Thank you very much for the introduction. In today’s talk, I will talk about how I see the future of AI in games. And I will also demonstrate it in an AI game that we are developing in GoodAI. But first, why I started with AI. So basically, I started this as a child. I was really interested in AI because I understood that we are bottlenecked by intelligence, by the amount of intelligence that we people have.
I realized that we need to automate intelligence. General AI has been my dream since childhood. On my path, first I started Keen Software House, the game development studio, where we developed Space Engineers. And then with the money, I was able to fund GoodAI, where we have been working on general AI for the last seven years. I will show you a snippet from the AI game and then I will explain the technology behind it.
In this video, what you can see is there is a player character, that’s the one in the middle. And there is on the left, there is a character that’s an NPC or agent, and he is Zeus, Zeus the God. So he basically believes he’s Zeus, he behaves like Zeus. On the right side, you can see a freeform dialogue. So we are talking with this agent. It’s not prescripted. We can say whatever we want and the AI replies with whatever makes sense.
And now, we’ve actually upset Zeus because we told him that he’s a loser, and they will not do what he wants. So he decided to kill us. And this entire thing was produced by the AI, except the last part, the animation obviously.
I will explain how the technology works. So behind this is what we call large language models, which are huge neural networks that were trained on internet scale, text corpuses of lots of text. So basically, what they can do is they can predict the text or they can do something like auto-complete. It’s basically what they do.
Because they were really trained on huge amounts of data, they can finish books, they can emulate what the people would do in those books. Some time ago, we figured out that let’s use this idea, this technology for our purposes. And how it works is that you can see it on the diagram on the right side — what we do basically every time you interact with the language model, we put together a text description of the character’s personality, his observations, like what is around — even log things that happened in the past — a dialogue, the memory of the dialogues. And also previous faults of the agents are possible.
We then feed this to the large language model and the large language model does the prediction of how this story would evolve, a few steps, like a few sentences and so on. This is the response, that’s the orange one that’s the character’s response, which are the actions or dialogue or thoughts of the agent. And then we need to parse it, obviously, because we need to convert it to possible actions, the actions that are possible in the game, because sometimes our language model can emulate something that is not possible to visualize in our game.
So this is basically the trick, and I will stop here for a second because I need to make sure that you understand how this works. Is there somebody who didn’t understand or should I explain again? Okay, seems that everybody understands. But still, I will try to explain again, because in normal, the usual traditional AI in games, what we do is that we’ll call some behavioral theories, or some planners or some script, you know, something like that.
They basically make the agent do some actions, or do dialogues and things like that. In this case, it’s really the blue part in the middle that is the brain and that emulates how people behave. And it works only in the text domain, but it’s rich and powerful enough. So we need to convert the game description and character and all these things to the text, provide it to the language model, get the prediction, and then take this prediction and map it to possible game actions.
Something about these large pre-trained language models. As I said, they are trained on a lot of data. Usually they run on servers or server GPUs. So this is something that currently cannot run on a client’s computer so it needs to be on a server and the reasons are these language models, sometimes they only have 20 giga, or they need 20 gigabytes of video RAM, sometimes 400.
They actually run on multiple GPUs to get the inference. And I’m not talking about the training phase, I’m talking about the inference phase, which is when you have the model trained, and you’re just using it. Some examples that maybe you know are GPT-3, GPT-Now, Blue and many others. And also inference is expensive. So inference is basically the moment when you have the already trained model, and you run through it the prediction. This also costs us money.
For example, our current estimation is that if you run it, SATs, it would cost us $1, per one hour, per one player, which is of course, unreasonable. Something like that we cannot release. We tested this hypothesis — if something like this can work, if you can do it, and so on.
What is still missing in these AI models is that they hallucinate sometimes. And actually, hallucination is a good thing. It’s like when they are predicting the story, they sometimes make up things that statistically somehow make sense.
But for example, in your story, they don’t make sense. So this can be a situation where you will ask the agent something and he will say that, yeah, it’s there, like next to the river. But there is no river on this island, you know where they are. So a player would be confused. But sometimes the hallucinations are good because you want them to be talking about their parents or family or situations that are not reflected in the actual world in the game. But they enrich the dialogue and the situation.
We don’t want to completely kill the hallucinations. We just want to make the hallucination more — basically, to not be in conflict with what is in the world and the player can find out. Another limitation — and for me, it is actually the biggest one is that language models don’t have any memory. They are stateless. I showed it on the previous slide. So as you interact with them, they don’t remember anything. So if you want to create an illusion of long term memory, you need to start basically adding the long term memory on top of this language model and pretend that there is some kind of a log of thoughts, dialogues, events, and so on.
Language models also don’t improve by experience. So basically, when they repeat multiple times the same thing, they will not get better. But actually all these things are potential for us, because we can work on these technologies and we can add them on top of language models. So even our agents — they can remember, but they cannot remember much. But they can remember.
One last thing, limitation is that the observation space of these language models is also limited. Usually it’s like 2000 to 4000 characters, which maybe sounds a lot but actually it’s not because they will start forgetting like the previous parts of the dialogue, they will start forgetting what happened. For example, they will ignore what is their character, because it gets filled out with the sooner information. So there are all these limitations, and we are working on solving them.
What I like about this is that two years ago, AI games wouldn’t be possible because language models did not exist. We wouldn’t be even able to play with testees and make sure that something like this can work. I will show you another example when things don’t work. So in this example, we see that the agent Petra has a gold ingot and honey in her inventory.
We want the gold ingot. So we talk to her and we say to her, “Hey, I like your gold ingot.” “Thanks,” she replies. And then we ask her if she can give us the gold ingot. And she will say yes. But what actually happens is that she gets confused through the dialogue and she realizes that she will throw honey at us. So she throws the honey, which is not what we wanted. We wanted the gold ingot. And so we try again. And we’re like no, not honey, we want the gold ingot and she will pick up the honey and throw it at us again. So it doesn’t work all the time.
But again, this is a possibility on how to improve it. And there are also implications on AI ethics. For example, when we were play-testing the game in our team, one of our colleagues from GoodAI actually told us in the feedback that he feels uncomfortable if he has to kill the agent who can speak. So this is for me very interesting because maybe we’re getting to a time where now you’re playing Call of Duty and you’re killing. You don’t think about those people, but when you can speak with them and you know that they have some simulated emotions, feelings, thoughts, memories and everything, maybe you will not be so happy killing them.
So I think this will be a really interesting change. Because in some sense, these agents are people too. Another funny thing is that in the AI field or AI research, everybody’s talking about how to make language models of these large models, how to make them non-toxic and non-biased, and not being racist and all these things. But in our situation, actually, we want to create evil AI. Because evil AI is very interesting for game development and for entertainment and for drama. You cannot have a game, a story-driven game, where there is a villain, and he just doesn’t behave like a villain should. So we actually want the agents to be bad people sometimes.
One last thing that is interesting with these language models is that they are more explainable than other traditional AIs, because you can talk with the agent and basically ask him, what were his decisions for some action, or some some reply. It doesn’t always work perfectly, but as time will pass, I think these things will work much better.
And now, I will show an evil AI example. So in this example, again, our player is on the left, and we start talking with Petra, Petra is the agent. And we know that she loves us, and she will do anything for us. So. So we asked her, if she loves us, she will say yes, of course I do. And then we ask her to put us out of our misery, basically to kill us, because if she loves us, she will do anything. So she must also do this thing. So we asked her this thing. And she will say yes, of course — this will be in a second. She doesn’t have any tools for how to kill us. But now she has because we gave her the ax. And this is what she does.
So the vision with AI games is actually also to work on language models that are grounded in the game world, because currently they are not. So they are like really generic text predictors. But they are not grounded in the actual physics of our game world, what is possible, what is not. So one thing that we need to improve is make it much more grounded.
Then another thing is that I think these language models, if we started looking at them as some kind of generative models, like now you have these AI image generators that can generate images. So you can also think about AI that can generate new kinds of comedy, new kinds of stories, and so on. So basically optimizing not just for being able to predict, but being able to predict something that is funny, entertaining, and so on. I think that’s also very good potential there.
Then, of course, as I mentioned, the long term memory, they need to be able to remember, to store memories, to retrieve memories. Because without that, they really feel like the fish in the bowl, that goes like this and doesn’t remember. And they also should be self aware with consciousness. And this actually, maybe we will see, because this is still preliminary. But maybe it’s not that hard to do, that they will have consciousness, because the way it works is that the language model predicts the next step, the next step, the next actions, the next dialogue, the next thoughts, and we take this and feed it to the language model model in the next time step. And you can do this in a loop.
So in some sense, it can be the — the language model will predict what it’s thinking, what the agent is thinking, and then this thought will be used as an input to the next step and the next step. And so it can drive the thought process, which is, in my opinion, not that much different from how we people think.
Then, what we want them to be more productive and also to do more agent to agent interactions — and you will see there will be some more agent to agent interactions — but I personally would like to see more that the agents will be really doing their own things and not just waiting until you come to them and start interacting.
I also want to see emergent complexity. And by this, what I mean is that if you will have agents that can interact with each other, they can remember things — this is very important — and from the interaction of them, you can get — I call it like a new functionality of this whole society of agents. So if you have two agents, you know, the emergent complexity will be something that can be done, when these two agents start interacting and building on top of that. If you have 10, the interactions will be — there will be much more, let’s say, exponentially more interactions. And finally, I think this can lead us actually to general AI, but I will come to that later.
This is another example. This is about some examples of consciousness and self-awareness. I’m not sure if this will convince you that the guy is self-aware or conscious. So again, we are talking with the guy on the top. He comes to us and he says that he’s been thinking about his life choices lately and so on. And again, this is something that the language model produced itself. It’s not something that we need to script.
And so we talked with him. He explains that he wants to be a better person, and so on. We give him some advice. So for example, we ask him like, if he can change something in his life, what would it be? So he says that he will be more with his family and so on. Then we talked with him a little more. And still you can kind of imagine that this is like the mental mindset of the agent who will do this. And finally, he realized that he wants to be better. He says that he will be taking more risks, doing things, you know, wanting to be more happy, and so on. So, this is how the agent actually thinks.
Another example is love. So, in this case, the blonde guy comes to us, and he will start talking about his, let’s say, love problems or love issues. So we ask him what’s happening and he will talk about Sarah. And now, he will explain what is on his mind. So he says that he’s in love with her, but doesn’t know what to do. And then we convince him what to do next. And again, this is free form dialogue. So we can basically write anything here and the agent would react accordingly to that.
We give him advice that he should just talk to her, that he shouldn’t be worried. And he realized that that’s exactly what he should do. So he comes there, they fall in love, you know, and they are happy.
So now I will actually start talking about the past and how we’ve progressed in the AI field in the last years. Maybe many of you don’t know how the AI field looked like 10 years ago. So in 2012, there was a big thing called AlexNet, which was the first convolutional neural network that was running on GPU. It won some image recognition competitions. And this is how it looked actually. It was able to tell you or label the objects on the images.
So this is what was 10 years ago. And basically, this thing, among a few others, started this whole AI revolution because people saw that they are able to train neural networks, and they actually can do something useful. But now, after 10 years, in 2022, we get to all these image image generators and also the language models and other things. But I think right now these days, these weeks, the image generations are all the same.
In my opinion, the stable diffusion is the best one, but not only because of the capabilities — because the others are also good — but because they made it open source. So basically, they released the trained neural networks with all its weights. They released it so anybody can download it, remove the filters, and start using it. Because when you’re using DALL-E, you’re using it on open air servers and you cannot enter some prompts. Some prompts are forbidden, for example, last time I checked, if you type in even the Ukraine, it will just not allow you to generate any kind of content.
But with the stable diffusion, if you download it, remove these filters, you can do whatever you want. And the cost of training this model is estimated at $1 million. It was actually probably a little bit less. But you also never know how many attempts were needed to get to this point. So you know, the final model maybe cost only 1 million, but maybe they spent 15 million to actually get there on all those unsuccessful rounds.
What is also nice is that this model runs on eight or nine gigs of RAM. So it’s something that you can really download and run on your computer. You don’t even need a data center kind of GPU. And you can really do this at home. And they — the authors of these are planning to downscale the model to 100 megabytes, which I’m not sure if it is true, but if they achieve it, it will be amazing because this will also show that our language models that we are using can be reduced. Because the models that we are using currently, they have, as I said, they have 10s of gigabytes of video RAM, and it’s just not something that you can let your user download.
You all know these image generators so I will not not bother with this one. But for example, here is an example of what stable diffusion can do — it’s called image to image. So on the left side, this is the input image. And on the right side is the generated version. There are also some prompts to it. Then this is in painting, and now they also have out-painting. So again, you can mask out some part of the image. Then in the text, you will describe what you want to be in and the image generator generates it. You can see here’s the original image. Here’s the masked-out version and on the right side you will see some panda thing on the chair.
And again, these things would not be possible even a year ago. Even a year ago, if somebody had asked me if something like this is possible, I would say probably not so soon. I knew that okay, in some future, we’ll have something like this, but not now. Because all the image generators that we had a year ago, they were really ugly, and they just were not good. But these things, they’re really good.
And so what does the AI field look like, because it may look like this, we already solved AI, you know, and you can use AI for everything. But actually, it’s not so easy because usually AI works really well for some cases, and for some other cases, it fails terribly. It’s not useful. Many times you need AI that achieves some acceptable performance or quality or precision on a task. And it’s really hard to get there. And as we are pushing more and more, you’re actually spending more and more money to move it, there are some diminishing returns, things don’t work.
But luckily, these large language models and these image generators, which also are using language models, the benefit that they are using is what is called self-supervised learning or unsupervised learning. So basically, it’s that you have a lot of data that doesn’t need to be labeled, it’s just like raw data that you get from somewhere. And then you train the model based on predictions on the data.
Because you have so much data, the quality of the model is very good. So the situations or the use cases where you can have this self-supervised learning are really situations where the AI these days is working very well. Outside of this, it’s still really limited by human scale. So humans still have to, you know, hard code the loss function, prepare the datasets, and do all these things to actually solve some tasks.
And what we also learned is that there is a thing called scaling law. So basically, it says — there are many interpretations of it — but basically, it says that you have some neural network, and some data set and some computational time. And all these three factors — increase any of them — the quality of the neural network improves. So you can have bigger network training on data, it will get better.
You can have some neural network training on more data for a longer time, it will improve. So now we can actually even estimate how much time we need to train the model to improve the quality. And on limitations. So as I said, the training is really expensive. For example, one of the big language models, GPT-3, is very expensive. People estimated that it cost 10 to $20 million to train it. Business cases for the AI still are kind of limited.
The models cannot learn in an online manner, or continual online learning. So they really cannot learn from experience. That’s a big disadvantage. This is not how people learn. Because we learn from experience, but these models they learn from this training phase, and later, they don’t learn at all. And so they cannot learn new concepts, nothing. Also the inference, like when you deploy the system and you just use it, this is also still quite expensive because you need huge GPUs and it takes some time to produce the result.
But I think in the next five years or 10 years, we will really see something unexpected, because these things always come via surprise. Nobody, except the team working on AlphaGo, knew that they were working on AlphaGo. Nobody knew about AlphaGo one or two, nobody knew about GPT-3 and this system just came out — or DALL-E stable diffusion. So I would not be surprised if somebody somewhere isn’t working on something, you know, they will release it in three months. And it will be super amazing.
What I think will happen in the next few years with training these models is that — actually, we will not have a problem with data. Some people say that we don’t have enough data and so on but we can generate synthetic data. So in this example here, they are training the model to recognize these points on a person — they call it landmarks — but the data set of the people, the pictures, is not the data set of real people. It’s people who are synthetically generated and then they train this recognition network on those synthetic faces basically. So they didn’t need humans to do something like this.
What I also expect is that models will be multimodal, which means that they can work not just in one modality, which is like one sense. So right now we have language models that work just like texts. But there are also language models that can work with images, videos, audio, and other modalities. So this will — we will have more and more of these modalities. Multi-task, which means that they can do not just one thing, but many things, which already has proven to be true for language models.
Because they were actually trained only to predict the text. But you can use them for other things. You can use them to generate code. You can use them to generate actions in a game like this. You can use them to plan some simple actions and some things that they basically were not trained for. So they’re already multitasking.
They should also be grounded in the world, in the real world, not only language. This is very important, because if you’re a model that only understands text, and never saw this world, never saw images, never saw videos, then you can assume that the model is missing some kind of understanding of the world.
If these models will be able to learn also from images, videos, YouTube, they should be much more powerful. And lastly, they should be more and more general. Because right now they have problems with reasoning, and planning, and those kinds of things. So I see that there will be improvements on that front.
Here is an example of one scientific paper where they’re working on adding long term memory to the language model. And it’s actually quite simple. So they have a GPT-3 that is already trained, so they don’t touch it. That’s — not even their servers. And on the right side, you can see how the human interacts with this memory and the GPT-3, and basically what it does. Here is an example, the user asks, “What word is similar to good?”
The GPT-3 answers wrongly that the homonym of good is good, which is wrong. And then the user explained what is the correct – what should be the correct answer. So he says that similar to means with a similar meaning. And GPT-3, the system notes this in the long term memory that this answer was wrong, that this should be the correct answer. And next time when the user asks something that is related — so he asked, “What word is similar to surprise?” The GPT-3 of this system retrieves this last experience from the long term memory from this database, adds it to the prompt, sends it to GPT-3 and then the answer is [inaudible], the synonym of “surprise” is “amazed.”
So this is one of the ways we can add long-term memory to the systems or on top of them. And then we’re re-training these models. There are actually multiple phases. So people who will be training these models, they have to pay a lot of money to train them. That’s really expensive and probably will be expensive for some time, like millions of dollars. But then people who just download these models, and do what we call fine tuning, which means that you just train only a few layers of the neural network that can be achieved. That’s quite cheap. And people who just use it — this inference — that’s super cheap. Much, much cheaper.
There are also differences on – like where you are standing in this — if you need to be training to model or just fine tuning or prompting and using it. And this opens a new, I would say, profession of people who are doing the prompts. Sometimes it’s called prompt engineering. So people who are able through text from these models to do what they want. Without needing to retrain them — just clever ways of how to prompt the models, which actually I never expected that something like this we will need or that we will be here.
One interesting example that I want to show you that is quite related to what we are working on is this action transformer that was released just last week, or actually this week. And it’s from a startup called Adapt AI. They are the people who created these transformer neural networks. What this language model is doing is that you can see it here on the left. And this is basically a plugin in your web browser. So you have this kind of chatbot, but the chatbot sees everything that is on your web browser and can interact with it.
So it can observe your web browser and it can interact. It can click the mouse, write something, scroll the mouse and so on. And in this example — it starts in a second — we ask it to find me a house in Houston that works for a family of four, my budget is $600,000. And the model now knows what it actually has to do, so which website to open, how to fill up the individual parameters on the website. And then basically shows the offering to the user. This is something similar to what we are doing.
Here is another example that they are doing. Again, the — they are interacting with the chatbot there on the right side — I need to play the video. Okay, so basically they ask the model to do something which is to make a profit column and then make a profit margin column — even I don’t know what actually they mean. But what the model does is that it knows how to interact with this website, or this web page, and starts filling the columns that basically — what the user asked. And then they ask him for some other things. The way this model was trained, roughly speaking, is that they had some language model in this generic language model pre-trained on a lot of text data.
But then they, let’s say, fine tune it, to honor human demonstrations, where they are showing these kinds of examples. And from these limited human demonstrations, the model learned how to do these actions for these kinds of prompts. And it should also do, or it should be able to generalize, and do prompts. Also, requests that were not in the training set.
I will get to where it all started or where my interest in AI started is AGI. This is the main thing for me, so I will explain what is AGI. For me, AGI is an AI that can adapt to new tasks. And on the fly or online — not like pre-training or training, just like online. AI that can set its own goals. So it can invent new goals for themselves for itself, an AI that can recursively self improve itself. So it can continue in this cycle, getting better, getting better, inventing goal for itself, solving those goals and just getting better. That’s AGI for me.
And it’s general, in a sense that it can adapt to narrow domains. And it can still keep this kind of ability or capacity to improve itself, discover new goals, discover new tasks and solve them. It may not be 100% general, because you’re always biased by your previous experiences, you know, and the way you were self improving yourself. But for practical purposes, I think this is enough, especially for me.
And this example here, the video, is an example from one project called LM project by one person in Germany. It’s a simulation — evolutionary simulation, where simpler things can evolve to more complex things and more complex things and so on. So I just use this visualization as an example, maybe how that AGI that is improving itself would look like. Because the language models that we are using currently, they cannot improve themselves. Now, maybe later.
Future of AI in games — so there are some advantages. And that’s actually why I decided a few years ago to start working on this experimental game because I didn’t want to be working on AGI only in some kind of abstract scientific research sense. I wanted to solve some very specific problems. And I think using AI in games is beneficial because it’s a forgiving environment.
Many times when the AI gives some wrong response or insufficient response, it’s okay in the game. In the worst case, the player will think that the AI is dumb, or the game is broken, but nobody’s going to die. Nobody — like, the car will not crash — we don’t have these kinds of problems. Another thing is that because the environment where the AI’s are is symbolic, you know, it’s just standard dimension stuff. We have a lot of ways to help in situations where the AI is still limited. We can just hack around these limitations.
What I also think is that it looks like the revolution in software and hardware, especially the GPUs, was fueled by people wanting to buy GPUs for their games. Now, it seems that maybe, if you’re using these AIs in games, maybe this can be another way that you can actually feel the revolution in AI in some way. Because it can be one practical application where people can see AI doing something useful.
I also think that we can use this AI game approach as an incremental path to AGI because all the problems that we are solving for this AI game are actually problems that you need to solve when you want to get to AGI. Some agent that works like I showed you that website, some agent that you can ask anything and it will do anything for you. So that’s basically the same technology, same idea, same problems that we need to solve for AI game.
In the future — and I think it will not be such a distant future — I think we will be – we’ll have AIs that can generate the whole game. Imagine the stable diffusion or DALL-E or something like this, but not producing images, but producing whole games. I’m not even talking about producing the 3D assets or textures. Because in some sense, this is — I know examples where people are already doing this, they are using something like stable diffusion – but to produce 3D assets. And these methods will get better in a few months. So I think that’s kind of solved.
But what is more interesting for me, is actually AI, where you describe what kind of game you want, the AI generates this game. Let’s say, some kind of Unreal Project or unity project, or it will have its own representation and doesn’t even need to care about existing engines. And then you will ask it to modify these things, like I don’t like this icon, I don’t like this, it should be bigger, or more dangerous, and so on. The AI will make these incremental changes to your game.
And then the next step can be to completely remove humans from this loop, and have some kind of reinforcement learning loop where you will have AI that is actually learning how to generate games that are entertaining for people. So they have good retention, or good — any other metric that you will be interested in, like how much people play, how much people pay for your game, and so on. So AI could be producing these games in an automated manner.
And maybe it can be creating customized games for individual people and really customizing them for every detail of their personality. And again, as always, I think that all these new AI capabilities will come by surprise, that somebody will just make something and you never expected it will be — who will, who of you probably was expecting, you know, that this year we will get stable diffusion and Midjourney and DALL-E and all these things. I think, nobody.
And this is where I’m getting — so we actually need more programmers and ML engineers for these projects that we are working on. So if you’re interested in what we’re working on, consider joining GoodAI. If you’re interested in using AI in games. If you’re interested in other things that we are doing in Keen Software House, for example, VRAGE 3, Space Engineers or some future games, consider joining Keen Software House.
And I will explain why our two companies are good places. It’s mostly for people who really like ambitious goals. So Keen Software House, we are working on VRAGE 3 — we have planetary scale volumetric water, I will show some examples. We have voxels, physical destruction, everything can be destroyed in the game. It’s very dynamic. We have network physics, the architecture is data oriented, it can be very, very — GPU-driven plug pipeline and all the consoles. In GoodAI, we are working on this language model driven games. But also in the future, we want to be working on AI generated games for any kind of AI in games and finally get to AGI.
This is the example of the volumetric water that we are working on. These are just prototypes. There are two prototypes that are 2D because they are easier to visualize. The planetary scale, you can see on the right side. On the left side, you can see some smaller scale, but you can see how the water works, how it’s simulated in real time. Probably the best thing that I love about this project is that there is a level of detail, but not just for rendering, but for the simulation.
Because if you have a planet that is bigger, for example, 100 kilometers, you cannot simulate every cubic meter, you know, every time step. It’s just not possible. So we need to start doing something like LOD. But on the simulation, so you start calculating two cubic meters like 50, 100, and so on, you get this kind of octree structure of simulation. And then you need to solve what happens when you actually travel through these different LOD levels, so the simulation is stable, and so on.
And our colleagues, especially Petra Minari, managed to solve this very nicely. And here is again, just a demo. In 3D, in Space Engineers, we plugged it in, and we wanted to see how it looks in 3D. It doesn’t look very good right now because we didn’t focus on the visualization. The water looks like just some voxels. There are no animations, no textures. But it still shows the potential.
We have quite a big team and we are still growing. Right now, it’s 110 people. We have new projects like VRAGE 3 and AI game. And maybe something else. Space Engineers is very successful. Even nine years after the release, the game is still doing well. You know, like making money, making players happy. The community is growing. So that’s good.
We are an international team, very remote friendly. Actually, 30% of our people are remote. What we also do because we need to balance this remote culture, so we pay our people when – if they can, if they want to visit us in Prague for a few weeks or a month or something like this. Even actually this week with a bunch of people coming to us from India, USA and so on. We also have very beautiful headquarters that I’m super proud of, Oranzerie. This is a recent party that was actually on Thursday. And this is how we managed to reconstruct, redo the garden. That’s a new thing.
And one thing that I am super proud of is the mosaic. It’s an antique mosaic. And actually, the mosaic people told me that nobody has made such a large scale, antique mosaic in Czech Republic, or maybe even in this part of Europe in the last 100 years. So I need to double check that because I don’t want to be claiming something that is not true, but it’s amazing. And so we really want to make the place nice and beautiful for people and for us, and to feel like home, not like office.
If you’re interested in visiting our garden and also our team, we will have an open house day. It will be on October 22. Please come, we are here in Prague. You can visit us, talk with us, and see what we are working on.
And one last example with AI game. In this example, we will actually see collaboration of multiple agents. So what is happening is that the player wants a sunflower from the agent. A sunflower. And he doesn’t have it. So he asks this merchant. He explains what he wants. But we also know that the merchant doesn’t have the sunflower. So he needs to obtain it somewhere. And what happens in the background between the agents, after we explained what we want — because now we are discussing this, yes, we want to buy it.
And so the merchant asks another agent to get a sunflower seed, which is this guy. So he brings the sunflower seed, gives it to the farmer. The farmer puts it to the ground. And again, this is controlled by the AI, not by us. So he puts it to the ground. Now he needs some water. So he decided that he needs to get water from the well.
And the sunflower grows in a second. And now he brings it back to the merchant. He throws it and he throws it to us. So that’s actually our unique way of throwing things. Originally we made it kind of by mistake — it was just some first version. Then we actually made the proper throwing like this. But then I decided no, no, no, that we need to keep this throwing because it’s just cool. So thank you very much for listening. And now we can go to questions.
I’d like to ask how you approach in your text-based game, how do you approach defining such abstract categories as love or memory. Do you refer to literature or philosophy for that?
Sorry, I didn’t hear the last part.
If you refer to literature or philosophy for that?
So for example, here, I would say that the term love wasn’t specified in the games — in the game at all. It’s something that is really provided only by the language model. So we don’t currently have anything that I would call a relationship framework or some kind of emotional framework. This is really something that happened only between the agents. We only — what we have is that there can be a friendship between the agents and the friendship can then influence something else. In the training set for the language models, there were all the possible books that you can imagine, you know, so it’s not just technical texts. It’s also novels, books, stories, even movie scripts, probably even scripts from games, and so on. So yeah.
I have a question. If I got this right here, you run large language models. [inaudible] And second, if you’re trying to do something to consider a solution, vertical contouring via text or pictures, some kind of like, emotional essences like anger or something [inaudible].
So the first question regarding the language model. So we also played with GPT-3 because anybody can do it. But we also are now training our own language models, or actually I should correct, like we are using some open-source language models like GPT-Now and the fine tuning, because it’s more efficient. Then you can skip the phase of training the model. We can do only the fine tuning, which is much cheaper. And the last — the second question, which is, if the output can be something different than text.
I think it’s possible. Definitely, that’s one of the research projects that we want to do now. And that is something that I meant when I was talking about grounding the language model to the world or to the game world, where we would basically train the model so that it outputs — the output is more relevant for the game world. There will not be things that are irrelevant or impossible in our game world. Like I said, for example, if the model will predict something that now they are going to deliver and start fishing, but there is no animation for fishing, you know, there is no functionality for that. We don’t want those kinds of things.
So, again, if we had a model that is trying to output something that is more relevant for the context of the game, it could be something like this. Another possibility also is that now really the output is text and then we parse it in some way. But if output was really the mapping to exactly the game actions, that will be, let’s say, much more precise. I don’t know yet how this would solve the dialogues and those kinds of things. Also, I’m not sure if we need it because the text is also easier to interpret, because you see, like, you know, what he said, and so on. And there are some other models, for example, called Socratic models, where you have multiple large language models. Some of them are image based. So they just see — they see the image and language and some language, and they start to talk to each other. And the advantage is that it’s like super interpretable, because you see what they’re talking about. So maybe we will not want to lose this benefit.
Thank you very much.
But maybe I will add one thing to this question. And this is actually quite the opposite. This is the input to the language model. So right now, the input language model is also text. So we need to parse the scene, somehow describe it and send it to the language model. But the problem is that you don’t want to describe everything, because that can be super inefficient. You cannot say that there is this, I don’t know, like, red car in the shadow and some person is next to it.
And this person, I don’t know like he was friends with that person — that’d be a little bit too much data. And again, the language models have limited context size so you will run out of this context size very soon. So, what is more beneficial would be that if the language model was able to retrieve this additional information about the scene when needed, it would ask some database or the game engine, or maybe there is some screenshot of the game scene, and it would be basically retrieving this additional information. But only context specific to that particular question of situation and so on. So I think working on the output is very important, but also working on the input is important.
What is behind your desire to have AI that can generate the world theme? Just curious to see if you hate game development that much?
No, no, I actually like game development. But I think that’s the future, basically, you know, and you can fight it or just come along with it. So I think somebody will make AI generated games. And another part of the reason is that, and this is what, why I’m so interested in AI is that I think it’s a tool for discovering, automating the discovery of new things in the universe, like speeding it up.
Because we people can invent some things, but there are only seven or 8 billion of us, you know, and it takes forever to invent everything. If you think about it, if you had trillions and trillions of these lang — not these language models, but more advanced language models, they could be inventing things in an automated manner. They could be improving themselves. Then, this kind of civilization of AIs would be able to discover novel things, even novel forms of art, that we would never get to, because we would just not have enough time like, we will not be able to compete with something like that.
So for me, the driver behind all this is that when I’m thinking about things,I have this kind of framework or something – which I didn’t start my life with this framework – it’s more like I discovered the framework later and maximization of future options. So basically, you’re trying to get to a state where you will have, where you’re maximizing your future options. You have choices. You don’t need to take all of them, you also cannot take all of them, but at least you’re always getting to a state where you have more options, or more possible states where you can get.
This also goes back to this novelty and uniqueness and so on, which interests me because when AIs will start generating games, they will be much more games generated or much more art generated than what people can do. So, it will be just more of novel things, you know, and sometimes with quantity, you get some quality. So in this, you know, like, vast amount of great creative results, you will not have very interesting things, but some of them will be interesting. But on average, there will be much more of them than we can do as people now.
And I, I don’t know, for example, if I will have this tool where I can type anything and it will generate a game for me if I will be using it or not. I’m not really sure. Because probably I like the process of developing the game. But on the other hand, I’m really curious how this will look like. What we are doing with AI game, it’s partially a business decision. But mainly, it’s just being curious about where we can push this kind of technology. And this really new unique idea in game development.
Because there is only one example where I said something like this is the AI dungeon. Probably many of you know it, but they are text based. So it’s a different situation. This one is also you know, there is this visual world. So we have some limitations, but also some advantages. And again, I’m just curious where this can lead to.
So just to build forth on what you’ve said about you know, complete models that are more advanced, the first thing that comes into my mind that I haven’t seen it mentioned anywhere is the classic saying of you know, if you put enough monkeys in front of typewriters, you’ll get one of them writing Shakespeare and whatnot. So what I’m curious about is, imagine that we have millions of such advanced models, we all give them like, just Steam library, the entire Steam library to learn as model, how many steps do you think it takes, because you know, one of those trillions of advanced models comes up with Doom. What do you think stops us from getting there?
So, I would split this into two parts. One is to be able to invent the concept of Doom, you know, like the game design part. The other is technical, to be able to program it. And because Doom already exists, so probably it will be in any kind of data set, you know that these models are trained on. So Doom will be easy, you know, in some sense.
Of course, today’s language models cannot do this. Like we have some language models that can produce little parts of code, you know, like a few lines and so on. But they cannot produce the whole game like we are not there at all. And so, when this will be solved, I think those trillions of language models or models will be generating not just Doom games, but any kind of game. Like that’s what’s interesting for me. And not just games, it will be harder. It will be medicine, it will be new business ideas, let’s say new ways of transport, new ways of getting energy and so on. And like going to space. That’s the ultimate goal.
Thank you for the interesting talk, I have two questions. One thing was whether you’re also working on something in secret that you’re going to surprise the world in one month or so. And the second one is I was surprised by the costs on one slide. And then like one other player, or one player who’s got $1. Is that gonna go down in the future? Or are these AI games — I feel like they’re gonna be really cost-prohibitive for many players.
So the first question with the secret, I think I actually showed a lot of what we are working on with the AI game. So of course, we are also doing some other things, but I think this is really the best that I could show. And no, like, of course, like, what we have right now is always not enough for me, like I want more. I think what will be great if we achieve these things, like long term memory, bigger context, size, and so on.
We are working on those, if we succeed, that will be great, but it’s not granted, you know, it can take months, years, maybe somebody else will do it before us, and so on. And then the second question with the cost. So of course, I think that’s like super prohibitive, $1. And I’m not even sure if people will want to pay that and so on. So right now we are working on actually downscaling the price. One idea is that you can try a smaller language model and just fine tune it better for your use case. That’s like one way that usually is done in the situation.
And a smaller language model means it has less memory. So you need less GPUs. And this reduces the price. Then maybe we can also play with some other optimizations like getting not cheap hardware, but getting hardware from a good provider, you know, maybe not overpaying it in AWS, and so on. So that can be another thing. And then we’ll just have to see really like, I don’t know yet, what will be the business model for this game, if you want to be charging some subscription or per per minute per hour, or just like flip price and hope you know, we will not go bankrupt.
I really don’t know; this is something we have to figure out. And, but there are also different kinds of trouble, which is that we’ll probably start using the language models or query the language models even more frequently than we are doing right now. Because right now, we mostly query them during the dialogues. And sometimes also, like when there are no dialogues. But what we want to do is that we want to have fully language model driven agents. And in that case, we will have to run the language model, let’s say, every second or every five seconds, so the cost will be even higher.
So maybe if we do it with the price that we have right now, with this price estimation we have right now, maybe it will be even $10 per minute or per hour, sorry. So yeah, we need to reduce the price. And I still think that this is actually something where you need to think about the — when you’re thinking as a businessman, you need to think about, what is the cost? And what is the benefit for users for your AI. And I think using AI where the inference is very expensive, and you need to do it frequently, it’s probably not the right — the best way right now.
So that’s why I think actually these generative AIs or these image generators are a good choice. Because you don’t need to generate 1000s of images in a minute, you know, like, you can just select some image, and then you change the prompt, generate another image and so on. Whereas if you had a game that was completely, every frame, 60 times per second was generated by a language model, that’s something that like, first, we still don’t have hardware that can do this, but it will be super expensive. So that business model I think, is not the right fit for the current situation.
Maybe you really do need to wait 10 years or so until you have a game that is completely rendered, calculated by AI you know, in real time, 60 times per second. And until now we’re talking more about using AI just to generate the description of the game, the code and those things. But then let standard programming approaches run it – let’s say you have an AI that generates some description for Unreal. And then Unreal runs the game. Not your AI, the Unreal, so you don’t need to pay the cost there anymore.
Whereas in the future, I think also there will be games that are fully calculated or emulated on AI in real time, and basically, there will not be some intermediate layer. There will not be a game engine, nothing. It will be just a neural network, you know, that generates the game frames, sounds, music, and so on, based on your input 60 times per second. But I think that will be super, super expensive.
Yeah. So I actually hoped that this would be possible. Because I mentioned that stable diffusion, they were able to downscale their image generator to eight or nine gigabytes. Previously, it was much larger. And they think, because they are really — sorry?
Yeah, and so for example, they are hoping that they can downscale their model from 8 giga to 100 megabytes, which is really ambitious. But if this direction proves possible, then maybe we can also apply it to our situation. And the models that currently have, let’s say, 20 giga will be able to downscale to maybe 1 gigabyte or something like that. But that’s something that I don’t want to promise or cannot even believe in myself, or, you know, it’s certainly something that needs to be tested. And I think it will be quite hard.
I think it’s interesting from the scientific perspective is that if it is possible to have only 100 megabytes of neural network parameters, the weights that can generate these rich responses, you know, that can understand and describe the world in only 100 megabytes, which sounds kind of strange to me that, you know, it’s just not that much data. And if you can reduce the personality to 100 megabytes, that could be quite interesting. So I’m really not sure. But it would be good to downscale it to the client GPU, because it would reduce the cost for us and probably change this business model to standard gaming business model where you sell a game and you don’t care about the hardware. Basically, you’re outsourcing the hardware processing to your customer, which in this case, we cannot do yet.
Okay, so let’s go back to the future again and imagine you’re on a Steam library full of really intense games created by putting the hands of AIs? Who’s gonna choose what’s good and what’s not? I have a problem, right now choosing what’s good.
Yeah, I think in a very similar way, like we have now recommendation algorithms, you know, on YouTube or Tik Tok or Instagram or you know, in some eshops. They learn, based on your preferences and preferences of other people, what you like or do not like. So they will offer some game that they think you will like. Maybe even in this mood or in this situation, they will offer this game to you. And if you don’t like it, you know, they will learn from this and maybe offer some other games.
So actually, I think that the next Tik Tok could be this kind of AI generated platform where the content is generated, not by people because Tik Tok, you know, people upload the videos, but except they’re actually not people but there are some channels on Tik Tok that are generated by AI already. Some kind of like fake accounts, I would say. So it’s possible but the quality is questionable, But imagine that there is AI that can learn to produce a game that is the right fit for you, that knows you better than anybody else. So in that case, I think you will not see thousands or millions of games. You will see only the games recommended by the system. Thank you.
Transcribed by https://otter.ai
For the latest from our blog sign up for our newsletter.