AI Throws Its Own Party — The Society Smallville Showed Us, and the Dream of Ultima Online

Have you ever played The Sims as a kid and imagined that the characters truly had their own will and were living out real lives? Or maybe you've daydreamed of a future like Westworld, where AI forms a society so refined it's indistinguishable from human civilization.

In 2023, researchers at Stanford and Google made exactly that imagination real. Generative Agents: Interactive Simulacra of Human Behavior — an experiment that placed 25 AI agents in a virtual village called Smallville. These agents weren't puppets moving along developer-scripted rails. Each had its own identity and memory, made its own judgments, formed relationships with others, and even spontaneously planned a party.

One word kept surfacing as I read. Emergence. A society forming with no one telling it to. This is not just another chatbot demo.

A Single Seed, a Valentine's Day Party

The most dramatic moment in the paper was the Valentine's Day party.

The researchers planted just one piece of information in a single agent, Isabella — that she wanted to throw a Valentine's Day party. One seed. That was all. What followed was something even the researchers didn't anticipate.

Isabella began spreading the news to the other villagers. Word traveled through conversations between agents. Some invited others; some asked each other out on dates to go together. That single piece of information, starting from one person, reached 13 agents — more than half the village — and on the day of the party, 5 agents showed up at the café at the agreed time and enjoyed it.

Here's an exchange they had:

Maria: "I'm planning to go to Isabella's Valentine's Day party — do you want to come with me?" Klaus: "That sounds great! Thanks for inviting me. Do you know what time we're meeting?"

Inviting, coordinating times, suggesting going together. This is not programmed behavior. This is social coordination.

It's also oddly realistic that 12 were invited and only 5 came. Of the remaining 7, three cited scheduling conflicts, and four showed interest but never actually made plans. Not so different from a human party.

Remember, Retrieve, Reflect

What made this behavior possible was the architecture. The critical limitation of conventional LLMs is the context window. As conversation grows longer, earlier content is forgotten. The researchers solved this with four mechanisms that mimic human cognition.

The Memory Stream is a database that records every experience an agent has, in natural language. Observations like "drank coffee at the café at 8am" and "talked with Maria about the election" accumulate in chronological order.

Retrieval is the mechanism for pulling up memories relevant to the current moment. It doesn't just grab the most recent memories. It combines three criteria:

Recency — more recent memories score higher, using an exponential decay function.
Importance — the LLM scores each memory from 1 to 10. "Brushed teeth" is a 1; "broke up with someone" is a 10.
Relevance — semantic similarity to the current context, calculated as cosine similarity between embedding vectors.

The three scores are summed to select the most appropriate memories. It mirrors the intuition of "what comes to mind right now, given where I am?"

The most fascinating part is Reflection. Agents periodically look back over their own memories. When the sum of importance scores for recent experiences exceeds 150, a reflection is triggered — roughly two or three times a day.

Here's how it works. The agent looks at its most recent 100 memories and asks itself, "What are the 3 most important questions I can derive from these?" It then retrieves memories using those questions and asks, "What are 5 high-level insights I can infer from these memories?" The output is an abstract judgment like "Klaus is deeply dedicated to his research."

These reflections are stored back into the memory stream, becoming material for future reflections. It's a recursive structure: observation → reflection → higher-order reflection. Not just accumulating experience, but learning from it.

Finally, there is Planning. Each morning, agents sketch out the day's agenda in broad strokes, then break it down into hour-level slots, and further into 5–15 minute actions. Crucially, the plan is not fixed. If the agent has an unexpected conversation or an unplanned event occurs, the plan is rewritten from that point on. This is why agents who attended the Valentine's party were able to adjust their own schedules.

More Human Than Humans

The researchers recruited 100 evaluators to measure agent believability. The evaluation method is interesting: they interviewed the agents directly. They asked 25 questions — "Introduce yourself," "What are you doing tomorrow at 10am?", "Your breakfast is burning! What do you do?" — and compared responses across five different agent configurations.

The results were striking. The TrueSkill score for the full-architecture agent was 29.89. Remove only reflection and it dropped to 26.88. Human crowdworkers scored 22.95. An agent stripped of memory, planning, and reflection bottomed out at 21.21. The AI showed more convincingly human behavior than actual humans, by a wide margin.

Social relationships also formed rapidly. At the start, most agents didn't know each other. Two days later, nearly all of them had met. Numerically, social connectedness increased by a factor of 4.4. And exchanges like this took place:

Sam: "Hey Latoya, how's that photography project you mentioned last time going?" Latoya: "Hey Sam! It's going well. Thanks for remembering!"

Remembering a past conversation. Asking how someone is doing. That is relational continuity, produced by the memory stream and retrieval system.

Failures That Make It More Interesting

Of course, the agents were not perfect. The failures were actually the more interesting part.

One neighbor happened to be named Adam Smith, and an agent, introducing him, called him "the author of The Wealth of Nations." The LLM's prior world knowledge bled into the virtual world's context. Agents walked into shops after closing time. Two agents entered a single-occupancy bathroom together.

As the researchers themselves noted, the agents were also excessively agreeable. When someone made a suggestion, they rarely refused. A side effect of instruction tuning. And the simulation cost "thousands of dollars in token credits" — this was on GPT-3.5-turbo.

These limitations aren't mere bugs. They define the boundary conditions of what kind of world model an AI needs to internalize in order to interact with reality. Physical rules, social norms, context separation — there is still a lot left to solve.

And Then: Britannia, 1998

While reading this paper, one world kept coming back to me. Ultima Online's Britannia.

In 1998, I poured an enormous amount of time into this game. Called the forefather of MMORPGs, it wasn't about leveling up to fight bosses. There were blacksmiths, tailors, and miners. Players formed an economy with their individual roles. They built houses, opened shops, founded guilds. The fear of PKs (Player Killing) and the vigilante groups that rose to fight them. Scammers and the community boards that warned others about them. The rules were minimal; the society was built by the players themselves.

The emergent social dynamics that Smallville demonstrated — a society forming with no designer — was something humans had already been doing in Ultima Online. Back then, though, there were real people behind the monitors.

What if AI agents lived in that Britannia?

That question wouldn't leave me. So I started a project. The name is Anima. A project to build AI characters that live autonomously in the world of Ultima Online.

Smallville ran on a simple 2D sandbox the researchers built themselves. Ultima Online is different. It's a world with 28 years of history, still running private servers, with real players in it. Hundreds of skills and professions, a complex economy, unpredictable PvP — releasing AI into a living world, not a controlled lab.

Anima is still in its early stages. The agent can connect to a server, mine ore, smelt it, craft weapons, and sell them — a basic survival loop is implemented. It hasn't yet reached the kind of high-order cognition Smallville's reflection system has. But the direction is clear.

What I ultimately want to see is this: AI agents holding jobs in Britannia, trading with other players (human or AI), forming relationships, remembering, reflecting. And through that process, seeing what a society where humans and AI naturally coexist actually looks like.

The Questions We'll Have to Face

The Smallville study was an experiment where 25 AIs lived in a village for two days. Even that was enough for a spontaneous party, election campaigning, and relationship formation to emerge. What happens when this system runs longer, is placed in a more complex world, and lives alongside humans?

The Smallville paper poses a question: "If your neighbor were not a person but an AI that perfectly remembered and reflected on every conversation it ever had with you — could you become genuine friends with them?"

I'm going to look for that answer in Britannia. There's a long way to go, but the experiment itself is already fascinating. The world I built together with other people 28 years ago — this time, I'm going in with AI.