References
[1]
S.
Russell and P. Norvig, Artificial intelligence: A modern
approach, 4th ed. Pearson, 2021.
[2]
M.
Wooldridge and N. R. Jennings, “Intelligent agents: Theory and
practice,” The Knowledge Engineering Review, vol. 10,
no. 2, pp. 115–152, 1995.
[3]
A.
S. Rao and M. P. Georgeff, “BDI agents: From theory to
practice,” in Proceedings of the first international
conference on multi-agent systems (ICMAS), 1995.
[4]
R.
S. Sutton and A. G. Barto, Reinforcement learning: An
introduction, 2nd ed. MIT Press, 2018.
[5]
R.
A. Brooks, “Intelligence without representation,”
Artificial Intelligence, vol. 47, no. 1–3, pp. 139–159,
1991.
[6]
A.
Newell and H. A. Simon, “Computer science as empirical inquiry:
Symbols and search,” Communications of the ACM, vol. 19,
no. 3, pp. 113–126, 1976.
[7]
J.
Wei et al., “Chain-of-thought prompting elicits reasoning
in large language models,” in Advances in neural information
processing systems (NeurIPS), 2022.
[8]
S.
Yao et al., “ReAct: Synergizing reasoning
and acting in language models,” in International conference
on learning representations (ICLR), 2023.
[9]
S.
Yao et al., “Tree of thoughts: Deliberate problem solving
with large language models,” in Advances in neural
information processing systems (NeurIPS), 2023.
[10]
N.
Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao,
“Reflexion: Language agents with verbal reinforcement
learning,” in Advances in neural information processing
systems (NeurIPS), 2023.
[11]
T.
Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large
language models are zero-shot reasoners,” in Advances in
neural information processing systems (NeurIPS), 2022.
[12]
X.
Wang et al., “Self-consistency improves chain of thought
reasoning in language models,” in International conference on
learning representations (ICLR), 2023.
[13]
L.
Wang et al., “Plan-and-solve prompting: Improving
zero-shot chain-of-thought reasoning by large language models,”
in Proceedings of the 61st annual meeting of the association for
computational linguistics (ACL), 2023.
[14]
S.
Kim et al., “An LLM compiler for parallel
function calling,” in International conference on machine
learning (ICML), 2024.
[15]
H.
Lightman et al., “Let’s verify step by step,”
arXiv preprint arXiv:2305.20050, 2023.
[16]
M. Besta et
al., “Demystifying chains, trees, and graphs of
thoughts,” arXiv preprint arXiv:2401.14295, 2024.
[17]
I.
Arcuschin, J. Janiak, R. Krzyzanowski, S. Rajamanoharan, N. Nanda, and
A. Conmy, “Chain-of-thought reasoning in the wild is not always
faithful,” in International conference on machine learning
(ICML), 2025.
[18]
Y. Chen et
al., “Reasoning models don’t always say what they
think,” arXiv preprint arXiv:2505.05410, 2025.
[19]
F.
Haji, M. Bethany, M. Tabar, J. Chiang, A. Rios, and P. Najafirad,
“Improving LLM reasoning with multi-agent tree-of-thought
validator agent,” arXiv preprint arXiv:2409.11527,
2024.
[20]
T. B. Brown et
al., “Language models are few-shot learners,”
in Advances in neural information processing systems (NeurIPS),
2020.
[21]
L. Ouyang et
al., “Training language models to follow instructions
with human feedback,” in Advances in neural information
processing systems (NeurIPS), 2022.
[22]
J.
Wei et al., “Emergent abilities of large language
models,” Transactions on Machine Learning Research
(TMLR), 2022.
[23]
Z.
Ji et al., “Survey of hallucination in natural language
generation,” ACM Computing Surveys, vol. 55, no. 12, pp.
1–38, 2023.
[24]
N.
F. Liu et al., “Lost in the middle: How language models
use long contexts,” Transactions of the Association for
Computational Linguistics (TACL), vol. 12, 2024.
[25]
L.
Berglund et al., “The reversal curse: Language models
trained on ‘a is b’ fail to learn ‘b is
a’,” arXiv preprint arXiv:2309.12288, 2024.
[26]
I.
Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M.
Farajtabar, “GSM-Symbolic: Understanding the
limitations of mathematical reasoning in large language models,”
arXiv preprint arXiv:2410.05229, 2024.
[27]
T.
Schick et al., “Toolformer: Language models can teach
themselves to use tools,” in Advances in neural information
processing systems (NeurIPS), 2023.
[28]
M.
Li et al., “API-Bank: A comprehensive
benchmark for tool-augmented LLMs,” in
Proceedings of the 2023 conference on empirical methods in natural
language processing (EMNLP), 2023.
[29]
S.
G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large
language model connected with massive APIs,”
arXiv preprint arXiv:2305.15334, 2023.
[30]
Y.
Qin et al., “ToolLLM: Facilitating large
language models to master 16000+ real-world APIs,”
arXiv preprint arXiv:2307.16789, 2023.
[31]
F.
Yan et al., “Berkeley function-calling leaderboard
(BFCL).” https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html,
2024.
[32]
Y.
Ruan et al., “Identifying the risks of LM
agents with an LM-emulated sandbox,”
International Conference on Learning Representations (ICLR),
2024.
[33]
P.
Lewis et al., “Retrieval-augmented generation for
knowledge-intensive NLP tasks,” in Advances in neural
information processing systems (NeurIPS), 2020.
[34]
G.
Mialon et al., “Augmented language models: A
survey,” arXiv preprint arXiv:2302.07842, 2023,
Available: https://arxiv.org/abs/2302.07842
[35]
A.
Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi,
“Self-RAG: Learning to retrieve, generate, and
critique through self-reflection,” in International
conference on learning representations (ICLR), 2024. Available: https://arxiv.org/abs/2310.11511
[36]
S.
Borgeaud et al., “Improving language models by retrieving
from trillions of tokens,” in International conference on
machine learning (ICML), 2022. Available: https://arxiv.org/abs/2112.04426
[37]
J.
S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S.
Bernstein, “Generative agents: Interactive simulacra of human
behavior,” in ACM symposium on user interface software and
technology (UIST), 2023.
[38]
G.
Wang et al., “Voyager: An open-ended embodied agent with
large language models,” arXiv preprint arXiv:2305.16291,
2023.
[39]
Z. Xi et
al., “The rise and potential of large language model
based agents: A survey,” arXiv preprint
arXiv:2309.07864, 2023.
[40]
Y.
Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch,
“Improving factuality and reasoning in language models through
multiagent debate,” arXiv preprint arXiv:2305.14325,
2023.
[41]
M.
Cemri et al., “Why do multi-agent LLM systems
fail?” arXiv preprint arXiv:2503.13657, 2025.
[42]
K.-T. Tran, D. Dao, M.-D. Nguyen, Q.-V. Pham,
B. O’Sullivan, and H. D. Nguyen, “Multi-agent collaboration
mechanisms: A survey of LLMs,” arXiv preprint
arXiv:2501.06322, 2025.
[43]
Q. Wu et
al., “AutoGen: Enabling next-gen
LLM applications via multi-agent conversation,”
arXiv preprint arXiv:2308.08155, 2023.
[44]
S. Hong et
al., “MetaGPT: Meta programming for a
multi-agent collaborative framework,” arXiv preprint
arXiv:2308.00352, 2023.
[45]
G.
Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem,
“CAMEL: Communicative agents for ‘mind’
exploration of large language model society,” in Advances in
neural information processing systems (NeurIPS), 2023.
[46]
A.
Smit, P. Duckworth, N. Grinsztajn, T. D. Barrett, and A. Pretorius,
“Should we be going MAD? A look at multi-agent debate
strategies for LLMs,” arXiv preprint
arXiv:2311.17371, 2024.
[47]
X. Liu, H. Yu, H. Zhang,
et al., “AgentBench: Evaluating LLMs as
agents,” in International conference on learning
representations (ICLR), 2024.
[48]
C.
E. Jimenez et al., “SWE-bench: Can language models resolve real-world
GitHub issues?” International Conference on Learning
Representations (ICLR), 2024.
[49]
T. Xie et
al., “OSWorld: Benchmarking multimodal
agents for open-ended tasks in real computer environments,” in
Advances in neural information processing systems (NeurIPS),
datasets and benchmarks track, 2024.
[50]
S.
Yao, N. Shinn, P. Razavi, and K. Narasimhan, “τ-bench:
A benchmark for tool-agent-user interaction in real-world
domains,” arXiv preprint arXiv:2406.12045, 2024,
Available: https://arxiv.org/abs/2406.12045
[51]
L.
Zheng et al., “Judging LLM-as-a-judge with
MT-Bench and chatbot arena,” in Advances in
neural information processing systems (NeurIPS), datasets and benchmarks
track, 2023. Available: https://arxiv.org/abs/2306.05685
[52]
S.
Es, J. James, L. Espinosa-Anke, and S. Schockaert,
“RAGAS: Automated evaluation of retrieval augmented
generation,” arXiv preprint arXiv:2309.15217, 2023,
Available: https://arxiv.org/abs/2309.15217
[53]
OpenAI, “Introducing SWE-bench verified.” https://openai.com/index/introducing-swe-bench-verified/,
Aug. 2024.
[54]
Anthropic, M. Grace, J. Hadfield, R. Olivares,
and J. De Jonghe, “Demystifying evals for AI agents.” https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents,
Jan. 2026.
[55]
Anthropic, E. Schluntz, and B. Zhang,
“Building effective agents.” https://www.anthropic.com/engineering/building-effective-agents,
Dec. 2024.
[56]
Anthropic et al., “How we built
our multi-agent research system.” https://www.anthropic.com/engineering/multi-agent-research-system,
Jun. 2025.
[57]
W.
Yan and Cognition, “Don’t build multi-agents.” https://cognition.ai/blog/dont-build-multi-agents, Jun.
2025.
[58]
Microsoft, “Agentic application
patterns.” https://learn.microsoft.com/en-us/azure/durable-task/sdks/durable-agents-patterns,
2026.
[59]
OpenAI, “New tools for building
agents.” https://openai.com/index/new-tools-for-building-agents/,
Mar. 2025.
[60]
OpenAI, “OpenAI agents SDK
documentation.” https://openai.github.io/openai-agents-python/,
2025.
[61]
Anthropic, “Model context
protocol.” https://modelcontextprotocol.io, 2024.
[62]
Model Context Protocol, “Model context
protocol specification (2025-06-18).” https://modelcontextprotocol.io/specification/2025-06-18,
2025.
[63]
X.
Hou, Y. Zhao, S. Wang, and H. Wang, “Model context protocol
(MCP): Landscape, security threats, and future research
directions,” arXiv preprint arXiv:2503.23278,
2025.
[64]
Z.
Wang et al., “MCPTox: A benchmark for tool
poisoning attack on real-world MCP servers,”
arXiv preprint arXiv:2508.14925, 2025.
[65]
A.
RoyChowdhury, M. Luo, P. Sahu, S. Banerjee, and M. Tiwari,
“ConfusedPilot: Confused deputy risks in
RAG-based LLMs,” arXiv preprint
arXiv:2408.04870, 2024.
[66]
LangChain, “LangGraph
documentation.” https://langchain-ai.github.io/langgraph/, 2024.
[67]
LangChain, “LangGraph
persistence.” https://docs.langchain.com/oss/python/langgraph/persistence,
2025.
[68]
LangChain, “Human-in-the-loop.” https://docs.langchain.com/oss/python/langchain/human-in-the-loop,
2025.
[69]
OpenTelemetry Authors, “Semantic
conventions for generative AI systems.” https://opentelemetry.io/blog/2026/genai-observability/,
2026.
[70]
OWASP, “OWASP top 10 for large language
model applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/,
2025.
[71]
B.
H. Sigelman et al., “Dapper, a large-scale distributed
systems tracing infrastructure,” Google, Inc., 2010. Available:
https://research.google/pubs/dapper-a-large-scale-distributed-systems-tracing-infrastructure/
[72]
S.
Kanzhelev, M. McLean, A. Reitbauer, B. Drutu, N. Molnar, and Y. Shkuro,
“Trace context.” https://www.w3.org/TR/trace-context/, Nov. 2021.
[73]
Arize AI, “OpenInference
semantic conventions.” https://github.com/Arize-ai/openinference/blob/main/spec/semantic_conventions.md,
2024.
[74]
C.
Packer et al., “MemGPT: Towards
LLMs as operating systems,” arXiv preprint
arXiv:2310.08560, 2023, Available: https://arxiv.org/abs/2310.08560
[75]
W.
Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang,
“MemoryBank: Enhancing large language models with
long-term memory,” arXiv preprint arXiv:2305.10250,
2023, Available: https://arxiv.org/abs/2305.10250
[76]
P.
Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav,
“Mem0: Building production-ready AI
agents with scalable long-term memory,” arXiv preprint
arXiv:2504.19413, 2025, Available: https://arxiv.org/abs/2504.19413
[77]
D.
Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu,
“LongMemEval: Benchmarking chat assistants on
long-term interactive memory,” in International conference on
learning representations (ICLR), 2025. Available: https://arxiv.org/abs/2410.10813
[78]
A.
Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang,
“Evaluating very long-term conversational memory of
LLM agents,” arXiv preprint
arXiv:2402.17753, 2024, Available: https://arxiv.org/abs/2402.17753
[79]
H.
Tan, Z. Zhang, C. Ma, X. Chen, Q. Dai, and Z. Dong,
“MemBench: Towards more comprehensive evaluation on
the memory of LLM-based agents,” in Findings of
the association for computational linguistics: ACL 2025, 2025, pp.
19336–19352. Available: https://aclanthology.org/2025.findings-acl.989/
[80]
W.
Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang,
“A-MEM: Agentic memory for LLM
agents,” in Advances in neural information processing systems
(NeurIPS), 2025. Available: https://arxiv.org/abs/2502.12110
[81]
Y. Hu et
al., “Memory in the age of AI
agents,” arXiv preprint arXiv:2512.13564, 2025,
Available: https://arxiv.org/abs/2512.13564