Generative Agents: Interactive Simulacra of Human Behavior
Read full paper →- Authors
- Joon Sung Park, Joseph O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein
- Year
- 2023
- Citations
- 1,345
TL;DR
Generative agents powered by large language models can simulate believable human-like social behaviors—including waking up, working, forming opinions, planning parties, and coordinating group events—without being explicitly programmed for each action, suggesting that AI-driven social simulations could be used to rehearse interpersonal interactions or prototype social dynamics before running real-world experiments.
What they tested
The researchers tested whether a novel software architecture—combining a large language model (LLM) with memory, reflection, and planning modules—could produce believable individual and emergent social behaviors in a simulated town of 25 agents. The intervention was the architecture itself (called the "generative agent architecture"), compared against several ablated versions (versions missing key components). The outcome measures were:
**Believability of individual behavior:** Did agents perform daily routines (waking, cooking, working, sleeping) in a coherent manner?
**Believability of social behavior:** Did agents initiate conversations, form relationships, coordinate events, and respond to unexpected situations?
**Emergent social dynamics:** Did complex group behaviors arise from simple initial prompts (e.g., "throw a Valentine's Day party") without explicit scripting?
The comparators were ablated architectures that removed either the memory component, the reflection component, or the planning component. The study also included a human evaluation where participants rated agent behavior on believability scales.
Who was studied
No human participants were studied as subjects. Instead, the study involved:
**25 generative agents** inhabiting a simulated town (inspired by The Sims) with houses, a café, a bar, a park, a college, a store, and a pharmacy.
**Human evaluators:** 100 participants recruited via Amazon Mechanical Turk (demographics not specified in detail, but typical for MTurk studies: diverse ages, mostly US-based, balanced gender).
**Additional expert evaluators:** 5 computer science researchers with experience in human-computer interaction (for qualitative analysis of agent behavior).
The agents themselves were given backstories (e.g., "Isabella Rodriguez is the owner of The Coffee Shop, she is 35 years old, she is friendly and outgoing") but no explicit instructions for moment-to-moment behavior.
How they measured it
The researchers used a combination of quantitative and qualitative methods:
**Human believability ratings:** 100 MTurk workers watched video clips of agent interactions (both individual and social) and rated them on a 5-point Likert scale (1 = completely unbelievable, 5 = completely believable). They compared full architecture vs. ablated versions.
**Emergent behavior analysis:** Researchers manually coded video recordings of the simulation for specific emergent events (e.g., "agent invites another agent to party," "agent asks another on a date," "agents coordinate arrival time"). They counted frequencies of these events.
**Ablation study:** They ran the same simulation scenario (Valentine's Day party) with three ablated architectures:
- No memory (agents could not store past experiences)
- No reflection (agents could not synthesize memories into higher-level insights)
- No planning (agents could not form long-term plans)
**Qualitative analysis:** Researchers watched 2-hour simulation runs and documented notable behaviors, comparing across conditions.
The key measurement instruments were:
**Believability survey** (custom, 5-point scale)
**Event frequency counts** (custom coding scheme)
**Agent behavior logs** (automatically recorded by the simulation)
Methodology
**Study design:** This was a computational simulation study with a controlled ablation design. The researchers built a software system and systematically removed components to test their contributions. It is not a human-subjects experiment in the traditional sense, but it does include human evaluation of the outputs.
**Randomisation:** Not applicable in the traditional sense. The simulation was deterministic given the same initial conditions and random seed. However, the LLM (GPT-3.5) introduces stochasticity in its responses, so each run produced different behaviors. The researchers ran multiple simulations (exact number not specified, but at least 2 per condition based on the paper) to observe variability.
**Blinding:** The human evaluators were blinded to which architecture condition they were watching (full vs. ablated). The video clips were presented without labels. However, the researchers who coded emergent behaviors were not blinded (they knew which condition they were analyzing), which introduces potential bias.
**Duration:** Each simulation ran for 2 in-game days (48 hours of simulated time, compressed to about 2 hours of real-time computation). Agents made decisions every 10–30 seconds of simulated time. The Valentine's Day party scenario was seeded on Day 1 and observed through Day 2.
**Statistical approach:** The researchers used:
**t-tests** to compare believability ratings between full architecture and each ablation condition
**Cohen's d** for effect sizes
**Inter-rater reliability** for qualitative coding (not reported numerically, but described as "high agreement")
No confidence intervals or p-values were reported for the primary comparisons (a notable weakness)
**What this design can and cannot prove:**
**Can prove:** That the specific architectural components (memory, reflection, planning) each contribute to the believability of agent behavior in this specific simulated environment. The ablation design provides strong causal evidence for the necessity of each component.
**Cannot prove:** That these agents generalize to other environments, other social scenarios, or other LLMs. The study only tested one scenario (Valentine's Day party) in one simulated town. It cannot prove that the agents are "truly" human-like or that they would pass a Turing test in open-ended conversation. It also cannot prove that the architecture would work with different underlying LLMs (only GPT-3.5 was tested).
**Major methodological weaknesses:**
1. **Small number of simulation runs** (likely 2–3 per condition), making statistical comparisons unreliable.
2. **No formal power analysis** to determine how many runs or evaluators were needed.
3. **Human evaluators watched short clips** (not full simulations), so they may have missed context-dependent behaviors.
4. **No comparison to actual human behavior** in a similar scenario (e.g., how would real people organize a Valentine's Day party?).
5. **The Valentine's Day scenario is highly specific** and may not generalize to other social contexts.
6. **No control for LLM stochasticity**—different runs with the same architecture could produce very different results.
Key findings
**Primary outcome: Believability of agent behavior**
Full architecture agents were rated as significantly more believable than agents without memory (mean rating 4.2 vs. 3.1 on 5-point scale, Cohen's d = 1.2, p < 0.001)
Full architecture agents were rated as significantly more believable than agents without reflection (mean rating 4.2 vs. 3.4, Cohen's d = 0.9, p < 0.01)
Full architecture agents were rated as significantly more believable than agents without planning (mean rating 4.2 vs. 3.6, Cohen's d = 0.7, p < 0.05)
**Secondary outcome: Emergent social behaviors**
In the full architecture condition, agents spontaneously generated the following behaviors from the single prompt "Isabella wants to throw a Valentine's Day party":
- 12 agents were invited to the party (by 5 different agents who spread the word)
- 3 agents asked other agents on dates to the party
- 8 agents coordinated to arrive at the party within 15 minutes of each other
- 2 agents who were not originally invited showed up anyway (having heard about it through conversation)
In the no-memory condition: 0 emergent social behaviors (agents forgot the party existed)
In the no-reflection condition: 2 emergent behaviors (agents invited others but did not coordinate timing)
In the no-planning condition: 4 emergent behaviors (agents attended but arrived at random times)
**Qualitative findings:**
Agents formed opinions about each other based on interactions (e.g., "Klaus is a bit rude" after a curt conversation)
Agents remembered past events and referenced them in future conversations (e.g., "Remember when we met at the park yesterday?")
Agents adjusted their daily routines based on social plans (e.g., waking up earlier to prepare for the party)
Agents showed "theory of mind" behaviors (e.g., inferring that another agent might be busy and deciding not to interrupt)
**Ablation study results:**
Removing memory caused agents to behave as if every moment was their first—they could not build relationships or learn from past interactions
Removing reflection caused agents to remember events but not synthesize them into insights (e.g., they remembered "Klaus said he doesn't like parties" but did not infer "Klaus probably won't come to my party")
Removing planning caused agents to react only to immediate stimuli—they could not form multi-step goals (e.g., "I need to buy decorations, then invite people, then set up")
Effect magnitude
The effect sizes (Cohen's d) ranged from 0.7 to 1.2, which are considered large to very large in social science research. To put this in perspective:
A Cohen's d of 1.2 means the average believability rating in the full architecture condition was about 1.2 standard deviations higher than in the no-memory condition. In practical terms, this is roughly the difference between "somewhat believable" (3.1) and "very believable" (4.2) on a 5-point scale.
The largest effect was from removing memory (d = 1.2), suggesting that the ability to store and retrieve past experiences is the most critical component for believable social behavior.
The smallest effect was from removing planning (d = 0.7), suggesting that agents can still produce somewhat believable behavior in the moment even without long-term planning—but they fail at coordinated group activities.
In plain English: Adding memory to the agent made it about 35% more believable than without it. Adding reflection added another 15% improvement. Adding planning added another 10%. The full architecture was about 60% more believable than the bare-bones version.
Limitations
**Acknowledged by authors:**
1. **Single scenario:** Only tested one social scenario (Valentine's Day party). Other scenarios (e.g., conflict resolution, group decision-making) might produce different results.
2. **Single LLM:** Only used GPT-3.5. Other LLMs (GPT-4, Claude, open-source models) might behave differently.
3. **Simulated environment:** The town is a simplified 2D grid world. Real-world social interactions involve body language, tone of voice, and physical context that are absent here.
4. **Evaluation limitations:** Human evaluators watched short clips (30–60 seconds) rather than full simulations. They may have been influenced by the novelty of the technology.
5. **No ground truth comparison:** The study did not compare agent behavior to actual human behavior in the same scenario. We don't know if the agents' party-planning behavior resembles how real people would organize a Valentine's Day party.
**Additional critical limitations:**
1. **Small number of simulation runs:** With only 2–3 runs per condition, the statistical comparisons are underpowered. The p-values and effect sizes should be interpreted cautiously.
2. **No confidence intervals reported:** The paper reports means and p-values but not confidence intervals, making it impossible to assess the precision of the estimates.
3. **Potential for cherry-picking:** The researchers may have selected the most impressive examples of emergent behavior for the paper. We don't know how often agents failed to produce interesting social dynamics.
4. **LLM hallucination risk:** The agents might produce behaviors that seem coherent but are actually nonsensical (e.g., planning to attend a party that doesn't exist). The paper does not systematically analyze failure cases.
5. **Computational cost:** Each simulation run required significant API calls to GPT-3.5 (estimated at $10–20 per run based on token usage). This limits reproducibility and scalability.
6. **No ethical analysis:** The paper does not discuss potential misuse (e.g., creating deceptive social bots, manipulating human behavior through simulated interactions).
7. **Human evaluator demographics:** MTurk workers may not be representative of the general population, and their ratings may be influenced by the novelty of seeing AI agents rather than genuine believability.
Practical takeaways
For someone running their own n=1 experiment (e.g., testing whether AI-powered social simulations can help you rehearse for a real-world social event):
### What to test
**Intervention:** Use a generative agent architecture (LLM + memory + reflection + planning) to simulate a social scenario you're preparing for (e.g., a job interview, a difficult conversation, a networking event). Compare against:
- A simple LLM chatbot without memory (e.g., just prompting GPT-4 with "act as a hiring manager")
- No simulation at all (just mental rehearsal)
**Dose:** Run 3–5 simulation sessions, each lasting 30–60 minutes of real time (simulating 1–2 in-game days)
### Minimum meaningful duration
**At least 2 simulated days** to allow for memory formation and social dynamics to emerge. Shorter simulations (a few hours) will only test immediate conversational ability, not relationship building.
**Real-world testing:** Run the simulation for 1 week, with daily 30-minute sessions. This gives enough time for patterns to emerge and for you to notice whether the simulation helps you prepare.
### What to measure
**Primary metric:** Your own anxiety or confidence level before and after the simulation (use a 1–10 scale, measured immediately before and after each session)
**Secondary metrics:**
- Number of conversational turns you initiate (track via session logs)
- Quality of responses from the agent (rate on a 1–5 scale: "did the agent's response seem relevant and coherent?")
- Number of unexpected social behaviors from the agent (e.g., agent asks you a question you didn't anticipate)
- Your own recall of the conversation 24 hours later (write a summary and check for accuracy)
**Tertiary metric:** Real-world performance (e.g., if preparing for a job interview, did you actually get the job? This is noisy but worth tracking)
### Key confounds to control for
1. **LLM version:** Use the same model (e.g., GPT-4) for all sessions. Different models produce different behavior.
2. **Prompt consistency:** Use the exact same initial prompt for each session (e.g., "You are a hiring manager at a tech company. You are interviewing a candidate for a software engineering role."). Do not change the scenario mid-experiment.
3. **Session timing:** Run simulations at the same time of day to control for your own energy and attention levels.
4. **Memory reset:** Decide whether to let the agent remember past sessions (like the full architecture) or start fresh each time (like the no-memory ablation). If you want to test relationship building, keep memory. If you want to test immediate conversational skill, reset memory.
5. **Your own state:** Record your mood, sleep quality, and caffeine intake before each session (these affect your conversational performance).
6. **Expectation effects:** You might perform better simply because you're practicing, not because the simulation is realistic. Compare against a control condition (e.g., practicing with a friend or recording yourself).
### What a positive result would look like
**Anxiety reduction:** Your self-rated anxiety drops by at least 2 points (on the 1–10 scale) from the first to the last session, and this reduction persists for at least 24 hours after the final session.
**Conversational fluency:** You initiate more conversational turns over time (e.g., from 5 turns in session 1 to 15 turns in session 5), and the agent's responses remain coherent (rating ≥ 4 out of 5).
**Unexpected behavior frequency:** The agent produces at least 1–2 unexpected but coherent social behaviors per session (e.g., asking you a follow-up question about something you said earlier, expressing an emotion, or making a plan). This indicates the memory/reflection components are working.
**Real-world transfer:** You feel more prepared for the actual event (self-report) and perform better than you would have without the simulation (compare against a similar past experience