Slaying the ML Dragon

23 min readJan 13, 2022

I’d like to tell you a story of high adventure…

Actually, I’d like to play a game of high adventure where you just need to stay on-script most of the time, because the goal is to capture the thought process that goes into creating a real-time generated narrative fiction for what would be an 80-hour long campaign. Well, really actually, it’s about playing the role of a large language model that has been trained on a dataset containing annotations from 30 million words worth of campaigns so story and plot mechanics can be queried, allowing humans to take a peek inside the black box.

It would be easy enough to train on hundreds of books about story mechanics, plot points, character development, and so on, but the rules of the MIRI Visible Thoughts Project say you have to generate these 1,000 action campaigns in real-time, as a text-based dungeon crawl adventure, and capture the writers thoughts about narrative and world building as they happen. The writer is of course the DM, or Dungeon Master. And they are the player. And the NPC’s. And the person that is documenting the writing process (or you could work with someone else and take turns, because having a single narrative voice doesn’t matter). Couldn’t you just read thru a short story and pick apart all of the unwritten details, the subtext? Nope. According to organizers, that would be harder to do, and just picking things out after the fact.

By their own estimation, such a campaign would be about 300,000 words, which is why they want 100 of them (to hit the 30 million word mark for training data). Seems obvious no one person is going to do all of that! What does their writers guide say are the best methods for having a group of writers achieving that? Guide? No no no. They are looking for people that can organize and manage a group of writers, and presumably, provide any of that stuff to them directly (for some cut of the prize money, which is another detail they will have to work out). Oh, and to qualify as one of these managers, you have to show you really understand what they are asking for by commenting on a 3-page sample, and providing the tiniest of writing examples. *And for some reason, some of those examples are of how not to do them correctly?

Of course, you don’t start handing out financial advances to people that only wrote 2 or 3 action steps and expect them to recruit “the actual writers” because wouldn’t those writers have to pass the same kind of test? Well, from what MIRI has posted to date, sort of, yes. Oh, wait, I left out a step. All of that is just the main challenge for a one million dollar prize. There is a smaller prize pool of $200k for the first 10 people that can submit one of those 1,000 step campaigns. That means each is worth $20k. Just don’t do the math on the larger challenge, because those same-sized runs are only worth $10k each.

Still, $20k for what seems like a fun writing assignment? It generated a lot of interest, and wasn’t limited to people in the ML community that also had some D&D role playing experience under their belt. The 300,000 word count still seemed daunting though. A popular suggestion was a 100-step campaign. A tenth the size for a tenth the prize money. Talk about low hanging fruit. Too low, in fact. There was a lot of back and forth about it not being enough to have a full adventure, but if it could be done, would they still accept them, and sure, if that really seemed to work, maybe they would accept a few.

Is this leading somewhere?

I’m glad you asked it that way, because yes, as the story naturally progresses, we start to see the bigger picture. The number of steps, or actions taken by the player, don’t correlate with the quality or variety of data. This crack in the armor opens the door to a more realistic writing process, and more manageable game. From a scalability standpoint, a single quest of 300 actions, done in 7 or 8 sessions (which simulates about 4 hours of actual play), would provide a complete story arc with all of the elements needed to generate a wide sample of “reasonings.” It even works out for MIRI as these quests can be worth $6k, meaning only $18k for the steps/word count they are aiming for, instead of $20k.

Each of those sessions can be broken down further into scenes. Meeting the blacksmith can be done as a single scene, with only that part of the world to worry about. As you create more scenes, you start to look ahead a bit, thinking about details not directly related, so of course have to capture those thoughts as notes about potential future events. Some of the places and items in your story are also going to need some backstory of their own, as if being their own character, which traditionally would be called lore. Events from one scene link to events from another, and this starts driving your narrative, giving the player’s character something to “play off of.” You will have to keep track of a lot of this “off document.”

Not that I need to tell you any of this, because that short sample you submitted, with a very specific, and very rigid (did we tell you how rigid we are on that? well, we will, about 6 weeks to late) thought-prompt, thought-action, thought-outcome format should have been enough to figure all of these details out. Or, we could have put them in a guide or something. /sarcasm

To be fair, many of the different types of information being captured are at least mentioned when describing what should go where, and how it should be formatted, though it feels a bit arbitrary.

What information did they provide?

I’m going to preface this by saying bullet points suck, and because the “tagged” text will just be a string of characters to a training system, they have zero advantage over literally any other method of writing. I will not be replicating the exact style of formatting from their examples.

The important section of their post, that I read over many times, goes into details of what the different “tags” should be capturing. Thoughts is the first, and is meant to represent all of the things about the story at that moment, including reasoning about events in the fictional world. But right away, it says this: “such as summaries of what just happened.” Directly from their updated FAQ, we get “we’re looking for the thoughts that lead to a prompt, not a post-hoc analysis of prompts.” The prompt is what just happened in the story. Circular confusion aside, thoughts are some in-game details that have not yet been expressly conveyed to the player. The world building stuff.

Thoughts, as I described in my “application,” is an overloaded tag because they also want the writer to represent the reasoning about the story at the big picture level. Things like plot tension or mysteries that were just introduced. In order to show the two different concepts, story reasoning must have (parenthesis) around them. An added step in parsing, but critically, much harder to keep track of when you need to scroll back 3 or 4 pages to find something.

*Conceptually, these seem like the same thing. From their examples, it seems like they are going for some in-world common sense vs story narrative and structure. That they may be looking for common sense at any level is a challenge in itself and not at all what they have initially claimed to be after.

Highlighting just how over-used the thoughts tag is, they also want any non-immediate information about the world, such as details about a character or important items. Things that are not relevant to the current scene, but might come into play later on. For this, you are told to use [square brackets].

Did you think that was enough? Optionally, they ask for meta cognition that steps outside of the world, outside of the larger story, and looks at how the actual writing process is going. {Braces} for those.

The next item is the prompt, which is the writer’s “output” into the story world. This would be the block of text that a player reads. The language model generated stuff. While all of the sub-items above are shown as bullet points under thoughts, prompt is just its own thing. They will even just abbreviate them as T and P (insert your own joke here). Confusingly, they also use output as a secondary prompt. They expect writers to think about the scene, write some output text for the player, have another thought about that, write what the player might say, have a thought about that, then write another output. Not wrong to assume that “output” is always going to be a reflection of “input” despite not having an actual player, but it’s never explained by the organizers. Samples are hard to follow because of this, not to mention the naming convention issue, but more on that in a bit.

Action is the player’s input. I went the extra step of making a distinction between the player talking vs doing an action because there is no real system that has to sort by intent. There was no way I was going to get thru their 30 page example, so I can’t say how consistent they were, but from their own short example you can see what it looks like when you don’t add quotes when a player speaks.

Thoughts: <none>

Player action: Good to see you too, Lieutenant. Now take me to see the body.

Thoughts: The player is talking to the Lieutenant.

Outcome: You say, “Good to see you Lieutenant. Now take me to the body.”

This feels like a mess of redundancy, and just capturing the player action in quotes would allow the writer to move to the next part of the scene. It is the writer, after all, that is playing the player. It’s a nice example of what current ML systems are really bad at doing, but if that was all they were trying to address, why not just do this for existing bodies of text? For the next example, remember that the first line is what someone typed, and the last line is what the model spit back out.

Player action: I finish smoking the cigarette and take a shower.

Thoughts: The player finishes their cigarette then takes a shower.

Outcome: You finish your cigarette and take a shower.

That is some grade-A rejection material if I ever saw it, but it’s their own example of what they are looking for. They add: “It’s unclear to us how much skill is required to produce this dataset. The authors likely need to be reasonably introspective about their own writing process, and willing to try things and make changes in response to initial feedback from the project leader and/or from MIRI.” Question is, will they be willing to make changes based on feedback from methods that promise to scale?

The writing assignment.

I wasn’t sure if I was going to include this at first because it’s not really about ML training data and the different ways the project seems to have stumbled out of the gate. However, they chose the writing theme of a text-based dungeon crawl adventure, presumably because of the popularity of AI Dungeon and its use of GPT-2/GPT-3.

Without a full gasp of their longer example, I can’t say for sure how the story unfolds, but there is a big difference between creating a dynamic, player-driven story and just writing a story where the players actions are not even considered as external to the process. I once again go back to the idea of a guide that helps explain the different aspects involved, not just in writing and formatting, but of the elements required to run an adventure. How often should the player do or say something that is the inconvenient or even dangerous to the narrative (while still being in character)? How should combat work? How is the placement of treasures, traps, and enemies supposed to be documented (the mapping problem; because even if it’s text-based, the DM is going to need a visual reference as part of the process)?

The point driven home by the organizers was that this had to be something new, something that was being created in real-time, because that’s the only way to capture the essence of reasoning. I suspect the examples they gave started with some big picture view, and that the writing was “easy” because they just had to connect the dots and fill-in the details. And rather than catching the thoughts of someone making up details as they went along, they were able to focus on the unspoken, common sense aspects that shape the thoughts that go into the thoughts.

This would be the “stop what you’re doing to be introspective about everything you know and have experienced in your life that led you write that last line” part. Asking people to do the hardest thing in the hardest way in the hardest format, and then having the nerve to be disappointed that everyone can’t do “such a simple thing.”

Fictional in name only.

In a fantasy setting, with non-human characters, what defines common sense? For their 3 page sample, they can use a common trope about the burned-out detective, with an un-kept apartment, not-so-friendly relationship with the police chief, and noir-style lighting of a cigarette in every other scene. It’s relatable information for DM and player alike. It’s just as common in movies and shows, where aliens and robots are not written as alien or robotic because their written by a human, for humans.

If you really do just focus on story, then you can write events that are at least believable within that world. There would be no need to understand why an Orc is having a bad day, dealing with whatever social issues Orcs normally deal with when the player isn’t around. The important point here is that a story can unfold as a sandbox, with all of the rules of the world being self-contained. I’d suggest that’s the whole point of escapism. As for training data, a lot of the real-world stuff that goes unspoken would have to be captured externally to the story (any subset would be of limited use, and it goes back to just getting all of this from books on story structure, tropes, etc).

Parsing and context switching.

From their main page: “Here’s an example of the first couple of steps of a thought-annotated dungeon run (or “quest”), in the format MIRI currently thinks is worth trying.”

That statement is bullshit. Here is what I was told by email after making my submission: “We are in fact sticklers about the thought/prompt/thought/action/thought/outcome format, and the various styles of brackets.”

I was asked (in said email) if I had seen the FAQ, and had in fact seen it last December, but when the email told me I should look at it to “help clear a few things up,” I went to make sure it was just as I remembered. Turns out it got a MAJOR update on Jan 7th, when my submission was on Jan 9th. Notes on formatting is right there at the top, followed by naming convention (a detail missing from all their other posts).

Within a single thought, with those darn bullet points, you get a lot of different details that are identified by not so easy to differentiate markers.

Thought:

World building.

(Story building.)

[Character building.]

{Building building.}

They are cleanly separated in the examples, but they don’t come out of the writer’s head like that. And as I intentionally look for it in the example, yes, there is a lot of “common sense” things being written, and a lot of explicit narrative. I’m not saying it’s wrong to dumb the details down so even an algorithm can (try to) understand them, but these are things that could be pulled from already written stories!

Naming convention is something else that will just pull a writer out of whatever thought they were having. The player must be referred to as “the player” but their character must be referred to as “Prota.” Not in the game, though. In the game, the players character will have their own unique name. I guess that would format as: characters name, (the player), [Porta], and… {actual name of person playing} the role as the player in real life, if there is one. How is that not confusing as shit?

They claim to be interested in the thought process behind story generation, but when people show their work, they complain. They don’t actually want self-reflection on “the process.” They want people to makes notes about the prompts and the actions that reflect how current ML systems absolutely fail to understand. Worse, they’ve said the dataset isn’t even targeting current models.

Do as we say, not as we do.

From their example:

Thoughts:

Prota is curious about why Morven is shouting. Prota is also worried that the situation is dangerous because Morven is a fire mage and Lior said that they might set the academy on fire.

(The plot will move forwards faster if they go see what’s happening. Nothing much is happening right now and it’s good if the plot moves to look at this new event.)

[Lior is a curious person] Lior also wants to see what’s going on.

Prompt: “I think we’d better,” Lior says. “He’s just a first-year student like us, I don’t think it’ll be too dangerous, and someone needs to do something if he’s about to set the school on fire!” Lior gets up and starts jogging toward the back doors of the dinning hall, taking a shortcut through the kitchens.

Thoughts:

[Morven is male.]

The other people in the dinning hall aren’t doing anything and are probably not very proactive.

Lior knows the school layout and that the fastest way to reach Morven is to go through the kitchens. Lior is in a hurry to see why Morven is shouting.

In the segment above, why not just say: Lior is a curious person and wants to see what is going on. The writer’s comment about trying to get past this point and move the plot forward should be removed from the world building section. The thoughts after the prompt seem a lot like “mentally late” examples that they have been very critical of. Details in the prompt are being explained after the prompt was written, which violates their own rule. Just going by their examples, but with a different format, what I wrote seems no worse.

From my session:

[ PROMPT ] You have enough money for a new sword, and for the winter cloak you are going to need so as not to freeze. Standing in front of the blacksmiths barn, there is a sign above the door that simply reads “for those of worth.”

[ LOGIC ] The player needs to be pushed into thinking about a winter cloak. The sign above the door is a clue about needing to prove themselves to the blacksmith.

[ ACTION ] I go inside to buy a sword.

[ THOUGHT ] The interior of the barn is smoke filled but warm. There are several horses in stalls along the far wall. It looks like fresh hay has been placed on the floor. No one has walked on it yet. The blacksmith has his back to the door. There are swords all around the work area, some finished, some still in progress.

[ PROMPT ] Your eyes need time to adjust as you go from the bright snowy exterior to this dark and smokey interior. You can see the faint outline of horses in the stalls along the far wall. The glint of bright flames reflects off a newly completed sword placed on a table at the far end of the barn. As you start to walk toward it, the blacksmith, with his back towards you, says “that’s fresh hay.” You stop and glance over at the horses again. Then he says “I hope you’re not tracking mud into my shop.” You look down to see that you have.

[ LOGIC ] Attention is drawn to a finished sword, but it’s not the one the player should get. Getting the hay dirty creates a conflict with someone the player needs to do business with. Tracking mud over fresh hay already creates tension.

[ ACTION ] “I will replace it for you if you sell me a new sword.”

// Here we see our first in-game quote.

Of course, I’m biased to say my version is easier to read, faster to write, and flows naturally because you’re writing the text of the story for the player or you’re stepping into the world to write “what you see.” The use of Logic as a tag covers the post-prompt details that ONLY came to light as the prompt was being written. The comment at the end is also a very simple // that is the most understood way to comment out a line of text, which is the point of {meta cognition} being external to the document.

Stream of Consciousness.

The first issue I came across when writing out a full session is that my thoughts about activity within the world were basically static until I wrote the prompt. I had no awareness of what I was going to write until I was actually writing it, and almost every time I was expressing “game mechanics.” The act of writing a prompt based on static thoughts led to in-game motion that had to be captured post-prompt as logic. This “logic” is the story stuff MIRI was asking for, just under a different tag, and at the correct time the DM, the writer, came to be aware of them. And at the top of my “submission” I even say that it doesn’t matter from a training stand point that the logic tag is after the prompt because it will just go in as pre-prompt data. It takes human intuition and flips it into something more methodical, more “training friendly.”

The PLAT Method.

This was something I kept on a notecard on my desk, under the monitor, but there were plenty of times when something could be skipped or even repeated. There is a PROMPT which represents the AI generated story content. It can set the scene, but also found some use as a narrative tool when breaking-up the output, so there would be a block of text to read, and then BAM!, something else just happened, rendered as a new block. And given the single player nature, where writer is player, it was sometimes convenient to skip the players action step and just go with the “default” outcome. I know that actually hurts the all-important step count, but doing so is probably why the session was able to get into the 300 plus words per step count, which is what is really supposed to be the goal.

As the Prompt is going to be generated by any writer at an intuitive level, in the same way someone might compose a song or paint a mural, the “directives” need to be captured, so LOGIC will always follow Prompt (though narrative or action may happen in a short loop where the Prompts are still drawing from other Thoughts, so it might be skipped). As stated above, if Thought is a still image of the scene, then Logic is a video clip. Thought tends to be about worldbuilding while Logic tends to be about story, so it covers the kind of details MIRI was publicly asking for.

The ACTION step is just the player input, and I’ve already covered how I don’t just repeat the input in the output, and that the use of quotes denotes the player speaking in-game vs taking an action (this is more of that “in the world” activity and story level actions being separated). I’ve only used Action a dozen times or less in a scene because flow is more important than formality. Having an action step where the player just continues walking up the path because that’s all that needs to happen right now, and is all the writer/DM is going to allow, is pointless.

THOUGHT is a bit like a story inside a story. With a few exceptions, I never really get inside the player’s mind. When writing a Thought, I feel like I’m taking the players place in the adventure. Once that clicked with me, writing became the easiest part of the challenge. But as the session progressed, and the world starts to flesh itself out, there were “thoughts” that didn’t belong in the scene, or even the next two scenes. I was going to use FUTURE as a new tag, and if I were to continue writing the quest, I might change to that. NOTES is the conventional term for look-ahead story and plot points, so I went with that.

A final tag, created near the end of the session, and retroactively applied to some of the comments, where I had been leaving them, was to use a LORE tag. There is a temple and a sword that have some larger in-world backstory, even if I had no idea what those would be when I started writing. I also used it to describe the player a Paladin, and to build-out his backstory.

When you include the forward slashes, everything MIRI wanted to tag/identify is covered. Just trying to write some of their examples out for this post is slow and complicated compared with just starting a block of text with its tag and then writing it out. And it is SO MUCH EASIER to jump back to another scene and scroll a page or two and instantly find the Note or Thought you needed. Marking inventory items in bold was also done to visually assist, and I’m not even sure the organizers considered this gaming aspect.

The real work.

I haven’t even touched the elephant yet. Their examples, as some of the keen-eyed may have noticed, are not set in a dungeon crawl universe. The reason, I suspect, is that they went with settings where they already had enough knowledge, enough reference material, to base a story in them without having to do any extra work. Probably half of my time was spent doing research, trying to find just the right details to make the world sound “authentic.” And they have mentioned, more than once, that they’d like to avoid copywrite issues, so everything needs to be original.

A single player adventure places a limit on combat and the types of creatures and spells they can deal with. Even looking at a trilogy of quests with the same protagonist, they might only progress a few levels up in strength and ability. It also requires a rethink because it’s going to be non-dice adventure (the player is basically immortal, but it’s easy to write them into a situation a real player would not get out of).

Despite the desire to use an annotated dungeon run model, there was no mention of character types. Was the player to be human for all runs? They did say that adventures with more than one player would be an issue (that’s the problem with having the Prota token for training), but couldn’t the players character have an NPC in a supporting role?

Reference material is required, even if you are just going to do a 300 step campaign. There is just too much time wasted hunting for that one bit of detail. If you come across a baker, and want to say something about the bread he is making (you, as the writer, looking at the scene, before having some idea of what information to give the player), a few minutes of looking around would lead you to say he was making flat plates of unleavened bread (lack of yeast; the kind of bread poor and common folk would have to make do with). Is that useful for the dataset? Unlikely, but it’s part of the thought process that comes from writing an adventure one action at a time, which was how the challenged was framed.

In my “application” document, which is really way more than they asked for, I tried to explain a lot of these ideas as part of my general understanding. The full text of that can be found here.

Day 1 Problem.

With such a high focus on “introspection,” the organizers should have shown a great deal more by way of planning and support. By not trying to run a session in an actual fantasy setting, they have been left unaware of the challenges and issues people would be facing. They had a very ‘current year’ Machine Learning focus on what they wanted to capture. The understanding required to generate coherent runs is actually a much deeper problem, and as a result, they reject one aspect to solely focus on another.

In their postings they make frequent reference to finding people to manage writing groups, or to have people that can demonstrate a method that scales. Creating a writer’s guide and reference material became the job for writers that were willing to be proactive. Did they expect a group that put time and effort into making dataset creation easier was also going to share that with the rest of the community? Collaboration was encouraged, just not supported (unless you could submit a 2 or 3 step example they liked, then the doors would open for you, but only at the expense of shutting others out).

Many of the comments on their original post, and in the Discord, showed that people were worried about the $200k in prize money for the 10 runs being “spoken for” by people that may never follow-thru, and that some advances will be wasted. This is another reason the 100 run idea caught-on, because it meant more chances of winning, and less risk of being nearly complete when MIRI announces the prize pool is now empty.

There has only been a whisper from organizers about not really getting the kinds of details they’ve wanted from submissions, and that was from week one. The “community” hasn’t posted any of their hit or misses. There have been no posts by the organizers saying “Hey, look at this great submission we got!” Will anyone manage to get paid for even a 100 step run? Is everyone doing it wrong, or did the organizers just fail match what they wanted with what they were asking for? Or, as I think is the case, they didn’t have a clear idea of what they really wanted, came up with a method and format that doesn’t work, and expectations will never be satisfied.

Final thoughts.

The format MIRI wants others to use just won’t scale, and while I think the content I created can be squeezed to fit, it’s going to be an ugly, complicated process. There is not even a guarantee they would accept it because of the drama over what they want, what they say they want, and what they themselves produced.

There are three elements at play. Improving on dynamic story generation where the model is able to manage plot lines and narrative coherence. I suspect most of the sample runs people are providing fit into this group, where the writers attention is drawn in relation to what they are showing the player. Placing a simple GOFAI in front of a LLM that can prompt it multiple times until it gets a “best fit” response would be the easy way to get a 10x improvement. *GOFIA Forward Model?

The second idea is the one the organizers have been explicit about, which is having an additional layer in the dataset that associates what is being communicated with the “reason” for doing so. It is a shallow way to “query” the black box, even if you’re just training two of them side-by-side and linking them with an index. You can imagine a story prompt with contextual highlights when you roll the curser over different sections. There format doesn’t seem to work with that idea as the action and thought are to far removed (better to have direct pairing, such as “You light the torch” = It’s dark in the cave).

*Box A takes “You light the torch” and outputs a hash of the full statement while Box B takes “It’s dark in the cave” and outputs the same hash value… Is that double ended? Back-to-back? Mirrored propagation? Quantum Entangled Backpropagation copyright trademark, etc.

The third element is laughable: From a dataset of 30 million words, train an AGI. It’s actually worse than that since you have story text plus common sense adages, making the count closer to a million samples. Samples taken from a fictional world that is inherently unrelatable to every day life. It’s this disconnect between what they think they want and what they are actually asking for that derails the entire effort. I had zero faith in the ability of such training data to actually work in creating a better model, but the focus was only on generating the data, which sounded easy enough. Updating my assessment to less than zero chance should this turn out to be the actual end point.

I’m posting a link to my full session for anyone to read. Obviously, since MIRI hasn’t paid me anything, the story and character IP are still mine. Legally, I don’t think they would accept my session in their format submitted by someone else because they know that person doesn’t have the rights to it. As I’m placing it on the internets, this does allow them to use parts of it as an example of what they don’t want, and to make a critical post about it.

As for this post, it was a cathartic release of frustration that comes from members of the AI community that seem to lack understanding of “the full stack” of the AI problem despite decades of historical observations.

Slaying the ML Dragon

Written by Steven D Marlow