Systematic ReviewWikiTop journalTime Management Language LearningHigh evidence score

A survey on large language model based autonomous agents

Authors: Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen
Journal: Frontiers of Computer Science
Year: 2024
DOI: 10.1007/s11704-024-40231-1
Citations: 1,039

TL;DR

This systematic review of over 200 papers on LLM-based autonomous agents found that agents built with a unified architecture (profiling, memory, planning, and action modules) can perform complex tasks across social science, natural science, and engineering — but their reliability, safety, and generalizability remain unproven, with no single benchmark or standardised evaluation framework yet established.

What they tested

This is a systematic review, not an experiment. The authors tested no single intervention. Instead, they:

**Analysed** the architectural designs of LLM-based autonomous agents across ~200 papers published between January 2021 and August 2023

**Categorised** agent designs into four modules: profiling, memory, planning, and action

**Compared** three methods for creating agent profiles: handcrafting, LLM-generation, and dataset alignment

**Reviewed** applications in three domains: social science (e.g., simulating human behaviour), natural science (e.g., scientific discovery), and engineering (e.g., software development)

**Evaluated** assessment strategies, distinguishing subjective (human judgement) from objective (automated metrics) approaches

The "outcome measures" were qualitative: the presence/absence of specific architectural features, the types of tasks agents could complete, and the evaluation methods used.

Who was studied

No human participants were studied. The "subjects" were:

**~200 published papers** on LLM-based autonomous agents from January 2021 to August 2023

**Specific agent systems** reviewed in detail: Generative Agents, MetaGPT, ChatDev, RecAgent, PTLLM, and others

**LLMs used as backbones**: primarily GPT-3, GPT-4, ChatGPT, and open-source models like LLaMA

No sample size, demographics, or population characteristics apply — this is a literature review, not a human study.

How they measured it

The authors used a **qualitative systematic review methodology** with the following approach:

**Literature search**: Papers were collected from arXiv, ACL, NeurIPS, ICML, ICLR, and other major AI venues

**Inclusion criteria**: Papers proposing or evaluating LLM-based autonomous agents published between January 2021 and August 2023

**Taxonomy development**: The authors created a unified framework (profiling → memory → planning → action) and classified each paper according to which modules it used

**Application categorisation**: Papers were grouped by domain (social science, natural science, engineering)

**Evaluation strategy analysis**: Each paper's evaluation method was classified as subjective (human raters, surveys) or objective (automated metrics, task completion rates)

No quantitative instruments (scales, questionnaires, physiological measures) were used because this is a review of existing literature, not a primary data collection study.

Methodology

**Study design:** Systematic review with qualitative synthesis. The authors did not perform a meta-analysis (no quantitative pooling of effect sizes) because the reviewed papers used heterogeneous tasks, metrics, and evaluation criteria.

**Search and selection:** The authors searched multiple academic databases and preprint servers. They do not report a PRISMA flow diagram, search strings, or explicit inclusion/exclusion criteria beyond the date range (2021–2023) and topic relevance. This is a methodological weakness — without a transparent search strategy, reproducibility is limited.

**Data extraction:** For each paper, the authors extracted:

Agent architecture (which modules were used)

Profile generation method (handcrafted, LLM-generated, or dataset-aligned)

Application domain

Evaluation approach (subjective vs. objective)

Key findings

**Synthesis approach:** The authors organised findings into a unified framework (Figure 2) and discussed patterns across papers qualitatively. They did not calculate inter-rater reliability, effect sizes, or confidence intervals.

**What this design can prove:**

It can identify common architectural patterns across the field

It can map the landscape of current research and highlight gaps

It can propose a standardised taxonomy for future work

**What this design cannot prove:**

It cannot establish which agent architecture is "best" — no head-to-head comparisons were performed

It cannot quantify effect sizes or statistical significance of any intervention

It cannot control for publication bias (papers with positive results are more likely to be published)

It cannot assess the quality or rigour of individual studies systematically (no risk-of-bias assessment is reported)

**Major methodological weaknesses:**

No pre-registered protocol

No systematic quality assessment of included studies (e.g., no use of ROBINS-I or similar tools)

No quantitative synthesis or meta-analysis

The search strategy is not fully reproducible

The review is descriptive rather than evaluative — it catalogues what exists rather than testing hypotheses

Key findings

**Architecture patterns:**

The unified framework (profiling → memory → planning → action) encompasses "most" previous work, though the authors do not report what percentage of papers fit this framework

Three profile generation methods were identified: handcrafting (most common), LLM-generation, and dataset alignment

Memory modules were classified into short-term (within-session) and long-term (cross-session) variants

Planning modules ranged from simple chain-of-thought prompting to hierarchical task decomposition

**Application domains:**

**Social science**: Agents have been used to simulate human behaviour in social settings (e.g., Generative Agents simulating a small town of 25 agents), study opinion dynamics, and model economic decisions

**Natural science**: Agents have been applied to scientific discovery (e.g., ChemCrow for chemistry, BioGPT for biology), though the authors note these are "early-stage"

**Engineering**: The most mature application area, with agents used for software development (MetaGPT, ChatDev), code generation, and tool use

**Evaluation strategies:**

**Subjective evaluation**: Human raters assess agent outputs for quality, coherence, or human-likeness. Used in ~40% of reviewed papers (estimated from figures, not explicitly stated)

**Objective evaluation**: Automated metrics (e.g., task completion rate, BLEU score, accuracy on benchmarks). Used in ~60% of papers

**No standardised benchmark exists**: Different papers use different tasks, making cross-study comparison impossible

**Capability acquisition strategies:**

**Fine-tuning approaches**: Some agents fine-tune LLMs on domain-specific data (e.g., code for programming agents)

**Prompt-based approaches**: Most agents use in-context learning (prompt engineering) without modifying model weights

**Tool use**: Many agents are equipped with external tools (e.g., web search, calculators, code interpreters) to extend capabilities

**Challenges identified:**

**Reliability**: LLM-based agents can hallucinate, produce inconsistent outputs, or fail on simple tasks

**Safety**: Agents acting autonomously could cause harm (e.g., generating malicious code, giving dangerous advice)

**Generalisation**: Agents trained/tested in one domain often fail in others

**Evaluation**: No consensus on how to measure agent performance

Effect magnitude

This is a qualitative review, so no effect sizes, confidence intervals, or p-values are reported. The authors do not quantify how much better LLM-based agents perform compared to traditional reinforcement learning agents or rule-based systems.

The closest to a quantitative finding: the cumulative number of papers on LLM-based autonomous agents grew from ~5 in January 2021 to ~200 by August 2023 — a ~40-fold increase in ~2.5 years. This is a bibliometric observation, not an experimental effect.

Limitations

**What the authors acknowledge:**

The field is "rapidly developing" and the review may not capture the most recent work

The proposed unified framework may not encompass all possible agent architectures

Evaluation strategies are "not yet mature"

The review is descriptive rather than prescriptive

**What a critical reader would note:**

**No systematic quality assessment**: The authors do not evaluate the rigour of individual studies. A paper with 5 participants and a paper with 500 are treated equally

**Publication bias**: The field is dominated by positive results (agents that work). Failed architectures or negative results are rarely published

**LLM dependence**: Most reviewed agents use proprietary LLMs (GPT-3/4). Results may not generalise to open-source models or future model versions

**No replication analysis**: The authors do not report whether any findings have been independently replicated

**Industry funding**: Many reviewed papers come from tech companies (OpenAI, Google, Meta) with commercial interests in LLMs. The review does not discuss conflicts of interest

**Temporal bias**: The review covers only 2021–2023. Given the field's rapid pace, findings may already be outdated

**No negative results**: The review focuses on what agents can do, not what they fail at. Failures are mentioned only briefly in the challenges section

**Lack of quantitative synthesis**: Without meta-analysis, it's impossible to know which approaches are statistically superior

Practical takeaways

For someone running their own n=1 experiment with LLM-based autonomous agents:

### What to test (specific intervention and dose)

**Intervention**: Build an LLM-based autonomous agent using the unified framework (profiling + memory + planning + action modules)

**Dose**: Start with a single agent performing one well-defined task (e.g., "write a Python script to scrape website X and save results to CSV"). Do not attempt multi-agent collaboration initially

**Comparison**: Compare against (a) doing the task manually, (b) using a simple LLM prompt without agent architecture, or (c) using a traditional rule-based system

### Minimum meaningful duration

**Per trial**: 1–3 hours for a single task completion

**Total experiment**: At least 10–20 trials across different tasks to assess generalisability

**Long-term**: If testing memory/learning, run 5–10 sessions over 1–2 weeks, with the agent retaining information across sessions

### What to measure (specific metrics)

**Task completion rate**: Did the agent complete the task? (binary: yes/no)

**Time to completion**: Minutes from start to finish

**Error rate**: Number of mistakes (e.g., syntax errors in code, incorrect outputs)

**Number of human interventions**: How many times did you need to correct or redirect the agent?

**Output quality**: Rate on a 1–5 scale (subjective, but use a rubric: accuracy, completeness, readability)

**Hallucination count**: Number of false statements or fabricated information

**Cost**: API calls made, tokens used, total cost in USD

### Key confounds to control for

**LLM version**: Use the same model version throughout (e.g., GPT-4-turbo-2024-04-09). Model updates can change behaviour

**Prompt engineering**: Small changes in prompts can cause large changes in output. Document your exact prompts

**Temperature setting**: Keep temperature constant (start with 0.2 for deterministic tasks, 0.7 for creative tasks)

**Task difficulty**: Vary task difficulty systematically. Don't compare easy tasks to hard tasks

**Order effects**: Randomise the order of tasks if comparing multiple conditions

**Learning effects**: The agent may "learn" from previous tasks if memory is enabled. Control for this by resetting the agent between conditions

**Human bias**: If you're evaluating outputs subjectively, use blinded evaluation (don't know which condition produced which output)

### What a positive result would look like

**Task completion rate**: ≥80% of tasks completed without human intervention (compared to ≤50% with simple LLM prompt)

**Time savings**: Agent completes tasks in ≤25% of the time it takes you manually

**Error reduction**: Agent makes ≤1 error per task (compared to ≥3 errors when you do it manually)

**Cost efficiency**: Total API cost is less than the value of your time saved (e.g., $2 in API calls saves you 30 minutes of work)

**Consistency**: Agent produces similar-quality outputs across 10+ trials (standard deviation in quality ratings ≤0.5 on 1–5 scale)

**Generalisation**: Agent succeeds on tasks it wasn't explicitly designed for (e.g., a "code-writing agent" can also debug existing code)

**Warning**: A single successful trial does not mean the agent is reliable. Run at least 10 trials before drawing conclusions. And remember: LLM-based agents are stochastic — the same prompt can produce different outputs each time. Track this variability.

Read full paper →More Time Management research

A survey on large language model based autonomous agents

What they tested

Who was studied

How they measured it

Methodology

Key findings

Effect magnitude

Limitations

Practical takeaways

Run a structured time management experiment