References

[1]

S. Russell and P. Norvig, Artificial intelligence: A modern approach, 4th ed. Pearson, 2021.

[2]

M. Wooldridge and N. R. Jennings, “Intelligent agents: Theory and practice,” The Knowledge Engineering Review, vol. 10, no. 2, pp. 115–152, 1995.

[3]

A. S. Rao and M. P. Georgeff, “BDI agents: From theory to practice,” in Proceedings of the first international conference on multi-agent systems (ICMAS), 1995.

[4]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, 2nd ed. MIT Press, 2018.

[5]

R. A. Brooks, “Intelligence without representation,” Artificial Intelligence, vol. 47, no. 1–3, pp. 139–159, 1991.

[6]

A. Newell and H. A. Simon, “Computer science as empirical inquiry: Symbols and search,” Communications of the ACM, vol. 19, no. 3, pp. 113–126, 1976.

[7]

J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in neural information processing systems (NeurIPS), 2022.

[8]

S. Yao et al., “ReAct: Synergizing reasoning and acting in language models,” in International conference on learning representations (ICLR), 2023.

[9]

S. Yao et al., “Tree of thoughts: Deliberate problem solving with large language models,” in Advances in neural information processing systems (NeurIPS), 2023.

[10]

N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Advances in neural information processing systems (NeurIPS), 2023.

[11]

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in neural information processing systems (NeurIPS), 2022.

[12]

X. Wang et al., “Self-consistency improves chain of thought reasoning in language models,” in International conference on learning representations (ICLR), 2023.

[13]

L. Wang et al., “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” in Proceedings of the 61st annual meeting of the association for computational linguistics (ACL), 2023.

[14]

S. Kim et al., “An LLM compiler for parallel function calling,” in International conference on machine learning (ICML), 2024.

[15]

H. Lightman et al., “Let’s verify step by step,” arXiv preprint arXiv:2305.20050, 2023.

[16]

M. Besta et al., “Demystifying chains, trees, and graphs of thoughts,” arXiv preprint arXiv:2401.14295, 2024.

[17]

I. Arcuschin, J. Janiak, R. Krzyzanowski, S. Rajamanoharan, N. Nanda, and A. Conmy, “Chain-of-thought reasoning in the wild is not always faithful,” in International conference on machine learning (ICML), 2025.

[18]

Y. Chen et al., “Reasoning models don’t always say what they think,” arXiv preprint arXiv:2505.05410, 2025.

[19]

F. Haji, M. Bethany, M. Tabar, J. Chiang, A. Rios, and P. Najafirad, “Improving LLM reasoning with multi-agent tree-of-thought validator agent,” arXiv preprint arXiv:2409.11527, 2024.

[20]

T. B. Brown et al., “Language models are few-shot learners,” in Advances in neural information processing systems (NeurIPS), 2020.

[21]

L. Ouyang et al., “Training language models to follow instructions with human feedback,” in Advances in neural information processing systems (NeurIPS), 2022.

[22]

J. Wei et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research (TMLR), 2022.

[23]

Z. Ji et al., “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.

[24]

N. F. Liu et al., “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics (TACL), vol. 12, 2024.

[25]

L. Berglund et al., “The reversal curse: Language models trained on ‘a is b’ fail to learn ‘b is a’,” arXiv preprint arXiv:2309.12288, 2024.

[26]

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models,” arXiv preprint arXiv:2410.05229, 2024.

[27]

T. Schick et al., “Toolformer: Language models can teach themselves to use tools,” in Advances in neural information processing systems (NeurIPS), 2023.

[28]

M. Li et al., “API-Bank: A comprehensive benchmark for tool-augmented LLMs,” in Proceedings of the 2023 conference on empirical methods in natural language processing (EMNLP), 2023.

[29]

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” arXiv preprint arXiv:2305.15334, 2023.

[30]

Y. Qin et al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” arXiv preprint arXiv:2307.16789, 2023.

[31]

F. Yan et al., “Berkeley function-calling leaderboard (BFCL).” https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html, 2024.

[32]

Y. Ruan et al., “Identifying the risks of LM agents with an LM-emulated sandbox,” International Conference on Learning Representations (ICLR), 2024.

[33]

P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in neural information processing systems (NeurIPS), 2020.

[34]

G. Mialon et al., “Augmented language models: A survey,” arXiv preprint arXiv:2302.07842, 2023, Available: https://arxiv.org/abs/2302.07842

[35]

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” in International conference on learning representations (ICLR), 2024. Available: https://arxiv.org/abs/2310.11511

[36]

S. Borgeaud et al., “Improving language models by retrieving from trillions of tokens,” in International conference on machine learning (ICML), 2022. Available: https://arxiv.org/abs/2112.04426

[37]

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in ACM symposium on user interface software and technology (UIST), 2023.

[38]

G. Wang et al., “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.

[39]

Z. Xi et al., “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.

[40]

Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” arXiv preprint arXiv:2305.14325, 2023.

[41]

M. Cemri et al., “Why do multi-agent LLM systems fail?” arXiv preprint arXiv:2503.13657, 2025.

[42]

K.-T. Tran, D. Dao, M.-D. Nguyen, Q.-V. Pham, B. O’Sullivan, and H. D. Nguyen, “Multi-agent collaboration mechanisms: A survey of LLMs,” arXiv preprint arXiv:2501.06322, 2025.

[43]

Q. Wu et al., “AutoGen: Enabling next-gen LLM applications via multi-agent conversation,” arXiv preprint arXiv:2308.08155, 2023.

[44]

S. Hong et al., “MetaGPT: Meta programming for a multi-agent collaborative framework,” arXiv preprint arXiv:2308.00352, 2023.

[45]

G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “CAMEL: Communicative agents for ‘mind’ exploration of large language model society,” in Advances in neural information processing systems (NeurIPS), 2023.

[46]

A. Smit, P. Duckworth, N. Grinsztajn, T. D. Barrett, and A. Pretorius, “Should we be going MAD? A look at multi-agent debate strategies for LLMs,” arXiv preprint arXiv:2311.17371, 2024.

[47]

X. Liu, H. Yu, H. Zhang, et al., “AgentBench: Evaluating LLMs as agents,” in International conference on learning representations (ICLR), 2024.

[48]

C. E. Jimenez et al., “SWE-bench: Can language models resolve real-world GitHub issues?” International Conference on Learning Representations (ICLR), 2024.

[49]

T. Xie et al., “OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments,” in Advances in neural information processing systems (NeurIPS), datasets and benchmarks track, 2024.

[50]

S. Yao, N. Shinn, P. Razavi, and K. Narasimhan, “τ-bench: A benchmark for tool-agent-user interaction in real-world domains,” arXiv preprint arXiv:2406.12045, 2024, Available: https://arxiv.org/abs/2406.12045

[51]

L. Zheng et al., “Judging LLM-as-a-judge with MT-Bench and chatbot arena,” in Advances in neural information processing systems (NeurIPS), datasets and benchmarks track, 2023. Available: https://arxiv.org/abs/2306.05685

[52]

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Automated evaluation of retrieval augmented generation,” arXiv preprint arXiv:2309.15217, 2023, Available: https://arxiv.org/abs/2309.15217

[53]

OpenAI, “Introducing SWE-bench verified.” https://openai.com/index/introducing-swe-bench-verified/, Aug. 2024.

[54]

Anthropic, M. Grace, J. Hadfield, R. Olivares, and J. De Jonghe, “Demystifying evals for AI agents.” https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents, Jan. 2026.

[55]

Anthropic, E. Schluntz, and B. Zhang, “Building effective agents.” https://www.anthropic.com/engineering/building-effective-agents, Dec. 2024.

[56]

Anthropic et al., “How we built our multi-agent research system.” https://www.anthropic.com/engineering/multi-agent-research-system, Jun. 2025.

[57]

W. Yan and Cognition, “Don’t build multi-agents.” https://cognition.ai/blog/dont-build-multi-agents, Jun. 2025.

[58]

Microsoft, “Agentic application patterns.” https://learn.microsoft.com/en-us/azure/durable-task/sdks/durable-agents-patterns, 2026.

[59]

OpenAI, “New tools for building agents.” https://openai.com/index/new-tools-for-building-agents/, Mar. 2025.

[60]

OpenAI, “OpenAI agents SDK documentation.” https://openai.github.io/openai-agents-python/, 2025.

[61]

Anthropic, “Model context protocol.” https://modelcontextprotocol.io, 2024.

[62]

Model Context Protocol, “Model context protocol specification (2025-06-18).” https://modelcontextprotocol.io/specification/2025-06-18, 2025.

[63]

X. Hou, Y. Zhao, S. Wang, and H. Wang, “Model context protocol (MCP): Landscape, security threats, and future research directions,” arXiv preprint arXiv:2503.23278, 2025.

[64]

Z. Wang et al., “MCPTox: A benchmark for tool poisoning attack on real-world MCP servers,” arXiv preprint arXiv:2508.14925, 2025.

[65]

A. RoyChowdhury, M. Luo, P. Sahu, S. Banerjee, and M. Tiwari, “ConfusedPilot: Confused deputy risks in RAG-based LLMs,” arXiv preprint arXiv:2408.04870, 2024.

[66]

LangChain, “LangGraph documentation.” https://langchain-ai.github.io/langgraph/, 2024.

[67]

LangChain, “LangGraph persistence.” https://docs.langchain.com/oss/python/langgraph/persistence, 2025.

[68]

LangChain, “Human-in-the-loop.” https://docs.langchain.com/oss/python/langchain/human-in-the-loop, 2025.

[69]

OpenTelemetry Authors, “Semantic conventions for generative AI systems.” https://opentelemetry.io/blog/2026/genai-observability/, 2026.

[70]

OWASP, “OWASP top 10 for large language model applications.” https://owasp.org/www-project-top-10-for-large-language-model-applications/, 2025.

[71]

B. H. Sigelman et al., “Dapper, a large-scale distributed systems tracing infrastructure,” Google, Inc., 2010. Available: https://research.google/pubs/dapper-a-large-scale-distributed-systems-tracing-infrastructure/

[72]

S. Kanzhelev, M. McLean, A. Reitbauer, B. Drutu, N. Molnar, and Y. Shkuro, “Trace context.” https://www.w3.org/TR/trace-context/, Nov. 2021.

[73]

Arize AI, “OpenInference semantic conventions.” https://github.com/Arize-ai/openinference/blob/main/spec/semantic_conventions.md, 2024.

[74]

C. Packer et al., “MemGPT: Towards LLMs as operating systems,” arXiv preprint arXiv:2310.08560, 2023, Available: https://arxiv.org/abs/2310.08560

[75]

W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, “MemoryBank: Enhancing large language models with long-term memory,” arXiv preprint arXiv:2305.10250, 2023, Available: https://arxiv.org/abs/2305.10250

[76]

P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav, “Mem0: Building production-ready AI agents with scalable long-term memory,” arXiv preprint arXiv:2504.19413, 2025, Available: https://arxiv.org/abs/2504.19413

[77]

D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu, “LongMemEval: Benchmarking chat assistants on long-term interactive memory,” in International conference on learning representations (ICLR), 2025. Available: https://arxiv.org/abs/2410.10813

[78]

A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating very long-term conversational memory of LLM agents,” arXiv preprint arXiv:2402.17753, 2024, Available: https://arxiv.org/abs/2402.17753

[79]

H. Tan, Z. Zhang, C. Ma, X. Chen, Q. Dai, and Z. Dong, “MemBench: Towards more comprehensive evaluation on the memory of LLM-based agents,” in Findings of the association for computational linguistics: ACL 2025, 2025, pp. 19336–19352. Available: https://aclanthology.org/2025.findings-acl.989/

[80]

W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang, “A-MEM: Agentic memory for LLM agents,” in Advances in neural information processing systems (NeurIPS), 2025. Available: https://arxiv.org/abs/2502.12110

[81]

Y. Hu et al., “Memory in the age of AI agents,” arXiv preprint arXiv:2512.13564, 2025, Available: https://arxiv.org/abs/2512.13564