References

[1]
S. Russell and P. Norvig, Artificial intelligence: A modern approach, 4th ed. Pearson, 2021.
[2]
M. Wooldridge and N. R. Jennings, “Intelligent agents: Theory and practice,” The Knowledge Engineering Review, vol. 10, no. 2, pp. 115–152, 1995.
[3]
A. S. Rao and M. P. Georgeff, “BDI agents: From theory to practice,” in Proceedings of the first international conference on multi-agent systems (ICMAS), 1995.
[4]
R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, 2nd ed. MIT Press, 2018.
[5]
R. A. Brooks, “Intelligence without representation,” Artificial Intelligence, vol. 47, no. 1–3, pp. 139–159, 1991.
[6]
A. Newell and H. A. Simon, “Computer science as empirical inquiry: Symbols and search,” Communications of the ACM, vol. 19, no. 3, pp. 113–126, 1976.
[7]
J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in neural information processing systems (NeurIPS), 2022.
[8]
S. Yao et al., ReAct: Synergizing reasoning and acting in language models,” in International conference on learning representations (ICLR), 2023.
[9]
S. Yao et al., “Tree of thoughts: Deliberate problem solving with large language models,” in Advances in neural information processing systems (NeurIPS), 2023.
[10]
N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Advances in neural information processing systems (NeurIPS), 2023.
[11]
T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in neural information processing systems (NeurIPS), 2022.
[12]
X. Wang et al., “Self-consistency improves chain of thought reasoning in language models,” in International conference on learning representations (ICLR), 2023.
[13]
L. Wang et al., “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” in Proceedings of the 61st annual meeting of the association for computational linguistics (ACL), 2023.
[14]
S. Kim et al., “An LLM compiler for parallel function calling,” in International conference on machine learning (ICML), 2024.
[15]
H. Lightman et al., “Let’s verify step by step,” arXiv preprint arXiv:2305.20050, 2023.
[16]
M. Besta et al., “Demystifying chains, trees, and graphs of thoughts,” arXiv preprint arXiv:2401.14295, 2024.
[17]
I. Arcuschin, J. Janiak, R. Krzyzanowski, S. Rajamanoharan, N. Nanda, and A. Conmy, “Chain-of-thought reasoning in the wild is not always faithful,” in International conference on machine learning (ICML), 2025.
[18]
Y. Chen et al., “Reasoning models don’t always say what they think,” arXiv preprint arXiv:2505.05410, 2025.
[19]
F. Haji, M. Bethany, M. Tabar, J. Chiang, A. Rios, and P. Najafirad, “Improving LLM reasoning with multi-agent tree-of-thought validator agent,” arXiv preprint arXiv:2409.11527, 2024.
[20]
T. B. Brown et al., “Language models are few-shot learners,” in Advances in neural information processing systems (NeurIPS), 2020.
[21]
L. Ouyang et al., “Training language models to follow instructions with human feedback,” in Advances in neural information processing systems (NeurIPS), 2022.
[22]
J. Wei et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research (TMLR), 2022.
[23]
Z. Ji et al., “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
[24]
N. F. Liu et al., “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics (TACL), vol. 12, 2024.
[25]
L. Berglund et al., “The reversal curse: Language models trained on ‘a is b’ fail to learn ‘b is a’,” arXiv preprint arXiv:2309.12288, 2024.
[26]
I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,” arXiv preprint arXiv:2410.05229, 2024.
[27]
T. Schick et al., “Toolformer: Language models can teach themselves to use tools,” in Advances in neural information processing systems (NeurIPS), 2023.
[28]
M. Li et al., API-Bank: A comprehensive benchmark for tool-augmented LLMs,” in Proceedings of the 2023 conference on empirical methods in natural language processing (EMNLP), 2023.
[29]
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” arXiv preprint arXiv:2305.15334, 2023.
[30]
Y. Qin et al., ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” arXiv preprint arXiv:2307.16789, 2023.
[31]
F. Yan et al., “Berkeley function-calling leaderboard (BFCL).” https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html, 2024.
[32]
Y. Ruan et al., “Identifying the risks of LM agents with an LM-emulated sandbox,” International Conference on Learning Representations (ICLR), 2024.
[33]
P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in neural information processing systems (NeurIPS), 2020.
[34]
G. Mialon et al., “Augmented language models: A survey,” arXiv preprint arXiv:2302.07842, 2023, Available: https://arxiv.org/abs/2302.07842
[35]
A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” in International conference on learning representations (ICLR), 2024. Available: https://arxiv.org/abs/2310.11511
[36]
S. Borgeaud et al., “Improving language models by retrieving from trillions of tokens,” in International conference on machine learning (ICML), 2022. Available: https://arxiv.org/abs/2112.04426
[37]
J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in ACM symposium on user interface software and technology (UIST), 2023.
[38]
G. Wang et al., “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.
[39]
Z. Xi et al., “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
[40]
Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” arXiv preprint arXiv:2305.14325, 2023.
[41]
M. Cemri et al., “Why do multi-agent LLM systems fail?” arXiv preprint arXiv:2503.13657, 2025.
[42]
K.-T. Tran, D. Dao, M.-D. Nguyen, Q.-V. Pham, B. O’Sullivan, and H. D. Nguyen, “Multi-agent collaboration mechanisms: A survey of LLMs,” arXiv preprint arXiv:2501.06322, 2025.
[43]
Q. Wu et al., AutoGen: Enabling next-gen LLM applications via multi-agent conversation,” arXiv preprint arXiv:2308.08155, 2023.
[44]
S. Hong et al., MetaGPT: Meta programming for a multi-agent collaborative framework,” arXiv preprint arXiv:2308.00352, 2023.
[45]
G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, CAMEL: Communicative agents for ‘mind’ exploration of large language model society,” in Advances in neural information processing systems (NeurIPS), 2023.
[46]
A. Smit, P. Duckworth, N. Grinsztajn, T. D. Barrett, and A. Pretorius, “Should we be going MAD? A look at multi-agent debate strategies for LLMs,” arXiv preprint arXiv:2311.17371, 2024.
[47]
X. Liu, H. Yu, H. Zhang, et al., “AgentBench: Evaluating LLMs as agents,” in International conference on learning representations (ICLR), 2024.
[48]
C. E. Jimenez et al., SWE-bench: Can language models resolve real-world GitHub issues?” International Conference on Learning Representations (ICLR), 2024.
[49]
T. Xie et al., OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” in Advances in neural information processing systems (NeurIPS), datasets and benchmarks track, 2024.
[50]
S. Yao, N. Shinn, P. Razavi, and K. Narasimhan, τ-bench: A benchmark for tool-agent-user interaction in real-world domains,” arXiv preprint arXiv:2406.12045, 2024, Available: https://arxiv.org/abs/2406.12045
[51]
L. Zheng et al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” in Advances in neural information processing systems (NeurIPS), datasets and benchmarks track, 2023. Available: https://arxiv.org/abs/2306.05685
[52]
S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, RAGAS: Automated evaluation of retrieval augmented generation,” arXiv preprint arXiv:2309.15217, 2023, Available: https://arxiv.org/abs/2309.15217
[53]
OpenAI, “Introducing SWE-bench verified.” https://openai.com/index/introducing-swe-bench-verified/, Aug. 2024.
[54]
Anthropic, M. Grace, J. Hadfield, R. Olivares, and J. De Jonghe, “Demystifying evals for AI agents.” https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, Jan. 2026.
[55]
Anthropic, E. Schluntz, and B. Zhang, “Building effective agents.” https://www.anthropic.com/engineering/building-effective-agents, Dec. 2024.
[56]
Anthropic et al., “How we built our multi-agent research system.” https://www.anthropic.com/engineering/multi-agent-research-system, Jun. 2025.
[57]
W. Yan and Cognition, “Don’t build multi-agents.” https://cognition.ai/blog/dont-build-multi-agents, Jun. 2025.
[58]
Microsoft, “Agentic application patterns.” https://learn.microsoft.com/en-us/azure/durable-task/sdks/durable-agents-patterns, 2026.
[59]
OpenAI, “New tools for building agents.” https://openai.com/index/new-tools-for-building-agents/, Mar. 2025.
[60]
OpenAI, “OpenAI agents SDK documentation.” https://openai.github.io/openai-agents-python/, 2025.
[61]
Anthropic, “Model context protocol.” https://modelcontextprotocol.io, 2024.
[62]
Model Context Protocol, “Model context protocol specification (2025-06-18).” https://modelcontextprotocol.io/specification/2025-06-18, 2025.
[63]
X. Hou, Y. Zhao, S. Wang, and H. Wang, “Model context protocol (MCP): Landscape, security threats, and future research directions,” arXiv preprint arXiv:2503.23278, 2025.
[64]
Z. Wang et al., MCPTox: A benchmark for tool poisoning attack on real-world MCP servers,” arXiv preprint arXiv:2508.14925, 2025.
[65]
A. RoyChowdhury, M. Luo, P. Sahu, S. Banerjee, and M. Tiwari, ConfusedPilot: Confused deputy risks in RAG-based LLMs,” arXiv preprint arXiv:2408.04870, 2024.
[66]
LangChain, “LangGraph documentation.” https://langchain-ai.github.io/langgraph/, 2024.
[67]
LangChain, LangGraph persistence.” https://docs.langchain.com/oss/python/langgraph/persistence, 2025.
[68]
[69]
OpenTelemetry Authors, “Semantic conventions for generative AI systems.” https://opentelemetry.io/blog/2026/genai-observability/, 2026.
[70]
OWASP, “OWASP top 10 for large language model applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/, 2025.
[71]
B. H. Sigelman et al., “Dapper, a large-scale distributed systems tracing infrastructure,” Google, Inc., 2010. Available: https://research.google/pubs/dapper-a-large-scale-distributed-systems-tracing-infrastructure/
[72]
S. Kanzhelev, M. McLean, A. Reitbauer, B. Drutu, N. Molnar, and Y. Shkuro, “Trace context.” https://www.w3.org/TR/trace-context/, Nov. 2021.
[73]
Arize AI, OpenInference semantic conventions.” https://github.com/Arize-ai/openinference/blob/main/spec/semantic_conventions.md, 2024.
[74]
C. Packer et al., MemGPT: Towards LLMs as operating systems,” arXiv preprint arXiv:2310.08560, 2023, Available: https://arxiv.org/abs/2310.08560
[75]
W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, MemoryBank: Enhancing large language models with long-term memory,” arXiv preprint arXiv:2305.10250, 2023, Available: https://arxiv.org/abs/2305.10250
[76]
P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav, Mem0: Building production-ready AI agents with scalable long-term memory,” arXiv preprint arXiv:2504.19413, 2025, Available: https://arxiv.org/abs/2504.19413
[77]
D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu, LongMemEval: Benchmarking chat assistants on long-term interactive memory,” in International conference on learning representations (ICLR), 2025. Available: https://arxiv.org/abs/2410.10813
[78]
A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating very long-term conversational memory of LLM agents,” arXiv preprint arXiv:2402.17753, 2024, Available: https://arxiv.org/abs/2402.17753
[79]
H. Tan, Z. Zhang, C. Ma, X. Chen, Q. Dai, and Z. Dong, MemBench: Towards more comprehensive evaluation on the memory of LLM-based agents,” in Findings of the association for computational linguistics: ACL 2025, 2025, pp. 19336–19352. Available: https://aclanthology.org/2025.findings-acl.989/
[80]
W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang, A-MEM: Agentic memory for LLM agents,” in Advances in neural information processing systems (NeurIPS), 2025. Available: https://arxiv.org/abs/2502.12110
[81]
Y. Hu et al., “Memory in the age of AI agents,” arXiv preprint arXiv:2512.13564, 2025, Available: https://arxiv.org/abs/2512.13564