Autonomous Agents

Autonomous Agents-research papers. Updated daily. Resources-section-section.

Research papers: 2025

2025, 2024, 2023, Earlier

Chronological order.

1st August 2025

Agentic large language models improve retrieval-based radiology question answering

Agentic RAG (Agentic Retrieval-Augmented Generation): introduces a multi-agent framework for radiology question answering, enabling LLMs to autonomously decompose questions, iteratively retrieve clinical evidence, and dynamically synthesize responses.
This framework significantly improves diagnostic accuracy and reduces hallucinations, particularly for mid-sized and small-scale LLMs, by grounding responses in real-time, evidence-based information from Radiopaedia.org.
The agentic approach supports interpretable, evidence-grounded QA, demonstrating complementary roles of retrieval and fine-tuning, and provides human-interpretable context for expert radiologists.

Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

Medical Reasoning LLMs Enhancement Taxonomy: introduces a systematic review of techniques to enhance medical reasoning in LLMs, categorizing them into training-time and test-time strategies.
The review analyzes how these techniques are applied across different data modalities (text, image, code) and in key clinical applications like diagnosis, education, and treatment planning.
It also surveys evaluation benchmarks, identifies challenges like faithfulness-plausibility gap and the need for native multimodal reasoning, and outlines future research directions.

Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

AVR-Agent (Multi-agent framework for audio-visual content generation): introduces a multi-agent system for generating interactive multimedia content, featuring a Text + Code Agent, an Omni-modal Agent, and the AVR-Eval metric.
The framework leverages Audio-Visual Recordings (AVRs) and console logs to iteratively refine JavaScript game and animation code, selecting the best initial content from multiple candidates.
AVR-Agent aims to automate game design by integrating asset selection, code generation, and an automated evaluation loop, demonstrating improved content quality over one-shot generation.

ContestTrade: A Multi-Agent Trading System Based on Internal Contest Mechanism

ContestTrade (A Multi-Agent Trading System Based on Internal Contest Mechanism): introduces a novel multi-agent trading system with a dual-stage pipeline, including a Data Team for factor generation and a Research Team for signal generation, both leveraging internal contest mechanisms for continuous self-optimization.
The system processes multi-source market data through specialized Data Analysis Agents, which generate context-friendly textual factors, and then passes these to Research Agents that utilize deep research methods and financial tools to produce actionable trading signals.
Its core innovation lies in the real-time evaluation and ranking within each team, ensuring only optimal outputs are adopted, thereby enhancing robustness against market noise and delivering superior trading performance.

Pro2Guard: Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking

Pro2Guard (Proactive Runtime Enforcement of LLM Agent Safety via Probabilistic Model Checking): introduces a proactive runtime enforcement framework that models LLM agent behavior as Discrete-Time Markov Chains (DTMCs) over symbolic abstractions, comprising Offline Sample, Domain-specific Abstraction, Learn DTMC, and Runtime Enforcement.
The framework anticipates future risks by estimating the probability of reaching unsafe states, triggering interventions before violations occur when predicted risk exceeds a user-defined threshold.
It ensures statistical reliability through semantic validity checks and PAC bounds, generalizing across domains like embodied household agents and autonomous vehicles.

CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization

CyGATE (Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization): introduces a game-theoretic framework that integrates LLMs with RAG to enhance patch strategy optimization in dynamic cybersecurity environments, featuring an Input Layer (gathers threat intelligence), Graph-Based Analysis (processes system topology), Knowledge Base (stores structured data), Embed (converts data to vectors), Vector DB (stores vector embeddings), RAG Segmentation (retrieves relevant threat data), Process Layer (simulates agent interactions), Attack Planner (LLM-augmented attacker agent), Defend Analyst (LLM-augmented defender agent), POSG Simulation (models cyber conflicts), Belief Status (agents track uncertainty), Payoff Functions (quantify financial outcomes), Output Layer (produces actionable insights), and Feedback Loop (updates knowledge base).
The framework models attacker-defender interactions as a Partially Observable Stochastic Game (POSG) across Cyber Kill Chain stages, enabling agents to adapt tactics and prioritize patches based on evolving risks and observed adversary behavior.
It leverages LLM-augmented RAG pipelines to continuously retrieve and incorporate contextualized threat signals, enhancing adaptability to novel TTPs and evolving attack campaigns.

ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

ReaGAN (Retrieval-augmented Graph Agentic Network): introduces an agent-based framework that models each graph node as an autonomous agent, equipped with Memory, Planning, Tools, and Action components, enabling individualized decision-making and adaptive message propagation.
Each node leverages a frozen LLM for in-context planning and utilizes Retrieval-Augmented Generation (RAG) as a tool to access global semantic information from the graph, which is treated as a searchable database.
This approach allows nodes to dynamically integrate both local structural and global semantic context, addressing limitations of traditional GNNs in handling information imbalance and long-range dependencies.

PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

PilotRL (Global Planning-Guided Progressive Reinforcement Learning): introduces an adaptive global plan-based agent paradigm, AdaPlan, which synergizes high-level explicit guidance from a Global Planner with action execution by an Executor, interacting within an Environment.
The framework employs a three-stage progressive reinforcement learning process, including Executor Enhancement, Global Planner Cultivation, and Joint Optimization, to improve agent capabilities and coordination.
PilotRL utilizes Group Relative Policy Optimization (GRPO) as its learning algorithm, featuring a Policy Model, Reference Model, and Group Computation to drive agent learning and performance.

Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

Conceptual Framework: introduces a comparative study investigating the divergence between human and LLM-generated tasks, featuring Human (task generator) driven by Values (personal motivations) and Embodied Experience (physical/social grounding) to produce Human Goals (generated task), contrasted with LLM (task generator) driven by Different Prompts (input conditions) to produce LLM Goals (generated task).
The study finds that human task generation is systematically influenced by psychological drivers and embodied experience, whereas LLMs fail to replicate these patterns, producing tasks that are less social and physical.
This research highlights a core gap between value-driven, embodied human cognition and the statistical patterns of LLMs, emphasizing the necessity of incorporating intrinsic motivation and physical grounding into future agent design.

Calibrated Language Models and How to Find Them with Label Smoothing

Efficient Smoothed Cross-Entropy Computation: introduces a novel method for applying label smoothing to LLMs, featuring a custom computational kernel that optimizes the cross-entropy loss calculation by leveraging block-wise processing, on-chip shared memory, and a lock mechanism for efficient forward and backward passes.
This approach addresses significant calibration degradation in instruction-tuned LLMs, particularly those with large vocabularies, by maintaining calibration throughout the supervised fine-tuning process.
The custom kernel dramatically reduces memory consumption for cross-entropy loss computation with label smoothing, without sacrificing speed or performance compared to existing solutions.

Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts

QoS-aware LLM Router: introduces a deep reinforcement learning (DRL)-based framework for routing LLM user requests to heterogeneous edge experts, aiming to maximize long-term Quality-of-Service (QoS) by considering dynamic workloads and resource heterogeneity.
The framework incorporates a Dynamic State Abstraction technique using a Heterogeneous Graph Attention Network (HAN) to compactly represent global state features and an Action Impact Estimator with a tailored reward function to guide the DRL agent.
This approach addresses challenges of LLM service heterogeneity, request interference, and dynamic workloads, ensuring sustained high-quality LLM services and preventing latency violations in edge computing environments.

How Far Are AI Scientists from Changing the World?

Capability-Level Framework for AI Scientist Development: introduces a staged roadmap for AI Scientist systems, with all components, where it systematically defines the stages of AI scientist development from foundational knowledge acquisition to continuous evolution.
The paper comprehensively analyzes current achievements of AI Scientist systems, identifying key bottlenecks and critical components required for the emergence of a scientific agent.
This survey contributes to understanding limitations of current AI Scientist systems, outlining what is missing, and defining ultimate goals for scientific AI.

Edge Agentic AI Framework for Autonomous Network Optimisation in O-RAN

Edge Agentic AI Framework: introduces an autonomous network optimization solution for O-RAN environments, integrating a persona-based multi-tools architecture, proactive anomaly detection via a traffic predictive tool, and a safety-aligned reward mechanism.
The framework, embedded within the RIC as an xApp, leverages an LLM, various tools, and memory, operating through a ReAct framework to monitor and control networks in real-time.
It achieves zero network outages under high-stress conditions by anticipating and responding to dynamic network conditions, ensuring near real-time responsiveness and consistent QoS.

A SURVEY OF SELF-EVOLVING AGENTS: ON PATH TO ARTIFICIAL SUPER INTELLIGENCE

Self-evolving Agents: introduces a comprehensive survey of self-evolving agents, organized around four fundamental architectural components: Models (underlying LLM/MLLM), Context (information shaping agent behavior), Tools (capabilities for external interaction), and Agentic Architecture (control flow, collaborative structures).
The paper details how these agents continuously learn and adapt from real-world feedback, aiming to overcome the static nature of traditional LLMs and pave the way for Artificial Super Intelligence (ASI).
It provides a structured framework for understanding and designing adaptive, robust, and versatile agentic systems, covering what, when, and how agents evolve, along with evaluation metrics and future directions.

31st July 2025

A Survey on Code Generation with LLM-based Agents

LLM-based Agents: introduces a systematic survey of LLM-based code generation agents, detailing their core architectural components including planning, memory, tool usage, and reflection, and exploring multi-agent system enhancements like workflow management, context management, and collaborative optimization.
The survey categorizes core techniques, applications across the software development lifecycle, evaluation benchmarks, and representative tools, while also identifying challenges and future research directions.
It highlights the evolution of these agents from simple text generation to autonomous systems capable of managing complex software development tasks.

PHYSICSEVAL: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Multi-Agent Review Framework: introduces a system for improving LLM reasoning proficiency on physics problems, including a Proposer Module (generates initial solutions), Verifier Module (assesses solution quality), and Meta-Verifier Module (filters, aggregates feedback).
This framework processes a Problem (input physics question) to produce a Proposed Solution (initial LLM output), which is then reviewed by multiple verifiers, leading to Aggregated Feedback (refined mistake list, score) that informs the final Solution (final refined answer).
The framework aims to reduce computational overhead by delegating verification to smaller LLM agents and provides an unbiased assessment by comparing mistakes across multiple verifiers.

SIMURA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

SIMURA (Simulative Reasoning Architecture): introduces a goal-oriented architecture for generalized agentic reasoning, featuring an Encoder (observation summarizer), Belief State (internal world representation), Planner (action sequence generator) with Policy (action proposer), World Model (outcome simulator), Critic (outcome evaluator), Simulated Action (high-level planning action), Actor (concrete action executor), Action (executable low-level command), and Memory (past interaction storage).
The architecture overcomes autoregressive LLM limitations by using an LLM-based world model for planning via simulation, enabling flexible planning in diverse environments.
SIMURA employs a hierarchical design that separates perception, simulative planning, and action selection, enhancing adaptability and consistency across various tasks.

TEXTQUESTS: HOW GOOD ARE LLMS AT TEXT-BASED VIDEO GAMES?

TEXTQUESTS: introduces a benchmark for evaluating LLM agents in complex, interactive text-based video games, featuring Infocom Interactive Fiction Games, an LLM Agent interacting with the Environment via System Prompt, Observations, Reasoning, and Actions, supported by a Context History, optional Clues (InvisiClues), an Autosave Mechanism, and evaluated using Game Progress and Harm Metrics.
This benchmark is designed to assess an LLM agent's self-contained problem-solving capacity by precluding external tools, focusing on intrinsic long-context reasoning and trial-and-error learning within a single interactive session.
The framework's enhancements, including clue-assisted evaluation, autosave/restore, and a checkpoint-based game progress metric, aim to provide a more accurate and direct assessment of LLMs as the reasoning backbone of AI agent systems.

A survey of multi-agent geosimulation methodologies: from ABM to LLM.

ARM (Agent Reference Model): introduces a formal specification for geosimulation platforms, integrating LLMs as agent components for perception, memory, planning, and action.
The ARM defines agent internal state structures (beliefs, goals, intentions, preferences, commitments, plans, history), internal dynamics (updating, activation, planning/execution mechanisms), external state (roles, use cases), and interface (skills, abilities, capabilities).
This framework provides a structured architecture for next-generation geosimulation systems, enabling LLMs to effectively contribute to fundamental agent activities and interactions within complex geographical simulations.

CFDagent: A Language-Guided, Zero-Shot Multi-Agent System for Complex Flow Simulation

CFDagent: introduces a zero-shot, language-guided multi-agent system for autonomous computational fluid dynamics (CFD) simulations, integrating a Preprocessing Agent, Solver Agent, and Postprocessing Agent, all guided by GPT-4o, to handle geometry generation, flow solving, and results visualization.
The system leverages Point-E for 3D geometry generation from text or images and an Immersed Boundary (IB) flow solver for accurate fluid dynamics simulations.
CFDagent enables end-to-end CFD workflows from natural language prompts, significantly lowering barriers to expert-level CFD by automating complex tasks and providing multimodal output.

TWEAKLLM: A ROUTING ARCHITECTURE FOR DYNAMIC TAILORING OF CACHED RESPONSES

TWEAKLLM (A Routing Architecture for Dynamic Tailoring of Cached Responses): introduces a novel routing architecture that dynamically adapts cached LLM responses to new prompts, utilizing a Query Preprocessing, Embedding Model, Vector Database, Cache Management, Cosine Similarity, Similarity Threshold, Small LLM, and Big LLM.
This two-tier system optimizes response quality, latency, and computational cost by leveraging a lightweight LLM to refine cached responses for similar queries, reducing reliance on a more expensive LLM.
The architecture significantly improves cache effectiveness and reduces inference costs while maintaining response quality comparable to frontier models, addressing limitations of traditional semantic caching.

MemoCue: Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying

Recall Router: introduces MemoCue (an LLM-based agent for human memory recall), with 5W Recall Map (classifies queries), Recall Strategy Pool (stores strategies), SGR-MCTS (optimizes strategy selection), MemoStrategy Dataset (tunes LLMs), and LLMs (generate cues), where the paper proposes a novel strategy-guided method to transform original queries into cue-rich ones for memory recall.
The framework leverages a hierarchical recall tree and Monte Carlo Tree Search to optimize strategy selection and response generation, incorporating a fine-grained reward mechanism based on simulated user feedback.
MemoCue, developed through instruction tuning, demonstrates superior performance in recall inspiration compared to traditional LLM-based methods, addressing challenges of limited memory data and effective cue generation.

DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer

DICE (Dynamic In-Context Example Selection): introduces a theoretically grounded in-context learning framework for LLM agents, which includes an Agent (LLM-based decision-maker), a Demo Pool (stores demonstration trajectories), a Knowledge Retriever (extracts transferable knowledge), and a Selection Mechanism (dynamically selects demonstrations) to enhance performance by maximizing transferable knowledge at each reasoning step.
This framework addresses the sensitivity of in-context learning to demonstration choice by mitigating spurious dependencies through a causal lens, ensuring only relevant knowledge is transferred.
Operating as a training-free, plug-in module, it consistently improves agent performance across diverse domains and existing agentic frameworks without additional training costs.

Chatting with your ERP: A Recipe

REACT-Based Text2SQL Architecture: introduces an LLM agent that chats with an industrial ERP system by interpreting natural language queries and translating them into executable SQL statements, leveraging open-weight LLMs, with a novel dual-agent architecture combining reasoning and critique stages.
The system's REACT Agent interprets user intent and delegates to the SQL Agent, which transforms natural language into optimized SQL queries through a collaborative loop between a SQL Reasoner and a SQL Critic.
The architecture enhances reliability by incorporating a Database Schema for context-aware SQL generation, a Human-in-the-Loop mechanism for user intent clarification, and a Reasoned Structured Outputs pipeline for robust LLM integration.

Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling

Trae Agent: introduces an LLM-based agent for software engineering with test-time scaling, which addresses large ensemble spaces and repository-level understanding through modular agents for generation, pruning, and selection.
The framework enhances LLM-based issue resolution by generating diverse candidate patches, eliminating redundant or faulty ones, and accurately selecting the most plausible solution.
It achieves superior performance on the SWE-bench benchmark, demonstrating robust effectiveness and scalability for complex software engineering tasks.

SWE-Exp: Experience-Driven Software Issue Resolution

SWE-Exp (Experience-Driven Software Issue Resolution): introduces an experience-enhanced approach that transforms software issue resolution from isolated problem-solving into a continuous learning process, with Trajectories Collection, Experiences Extraction, ExpAgent, Experience Bank, Issue Type, Description, Comprehension Experiences, Modification Experiences, Embedding Model, Experience Reuse, Experience Retrieval, Rerank Agent, Dual-Agent Architecture, Instructor Agent, Assistant Agent, and Monte Carlo Tree Search, where it distills concise and actionable experience from prior agent trajectories to guide future repair attempts.
The framework maintains an evolving multi-faceted Experience Bank that captures successful and failed repair attempts, encoding knowledge across trajectory-guided problem understanding, fault localization patterns, and modification strategies.
The approach employs a dual-agent architecture, where an Instructor agent formulates high-level strategies and an Assistant agent executes low-level operations, leveraging accumulated knowledge to avoid redundant exploration and improve patch quality.

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

SWE-Debate: introduces a competitive multi-agent debate framework for software issue resolution, with Issue Description, Dependency Graph Construction, Entry Node Identification, Fault Propagation Trace Generation, Localization Chain Selection, Specialized Agents, Modification Plan Proposal, Competitive Strategy Refinement, Discriminator Agent, Monte Carlo Tree Search (MCTS), Environment, Editor, and Patch Generation, designed to promote diverse reasoning paths and achieve consolidated issue localization for automated repository-level issue resolution.
The framework operates through a three-stage pipeline: generating multiple fault propagation traces, organizing a three-round debate among specialized agents, and integrating the consolidated fix plan into an MCTS-based code modification agent for patch generation.
This approach addresses limitations of independent agent exploration by leveraging competitive multi-agent reasoning and graph-based dependency analysis to improve fault localization accuracy and issue resolution rates.

DSBC : Data Science task Benchmarking with Context engineering

DSBC (Data Science task Benchmarking with Context engineering): introduces a comprehensive benchmark for data science agents, evaluating LLMs across various tasks and prompting methodologies, including Context Engineering, Single-step, Multi-step, and SmolAgent approaches, with evaluation performed by a VLM-as-a-Judge.
The benchmark is designed to reflect real-world user interactions and assess LLM sensitivity to common prompting issues like data leakage and ambiguous instructions, utilizing diverse data science task categories.
The research investigates the impact of temperature parameters on LLM performance and identifies critical factors for practical deployment of data science agents.

DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-agent System

DynaSwarm: introduces a dynamic framework for LLM-based multi-agent systems, enhancing adaptability and accuracy by dynamically selecting optimal graph structures per query, with components including Swarm Structure Initialization (initializes agent graph), Reinforcement Learning Scheme (optimizes graph structures), Learned Swarm Structures (candidate graph topologies), Graph Selector (selects optimal graph), LLM Backbone (underlying language model), LoRA Modules (adapts LLM for selection), Pooler (aggregates hidden states), Linear Prediction Module (outputs selection score), LLM Agents (perform specific operations), Nodes (represent inference procedures), BranchingStep (creates multiple paths), GreedySteps (executes sequential steps), Reflection (refines previous outputs), ReturnAll (aggregates results), and Edges (define communication order).
The framework unifies a novel reinforcement learning scheme for discovering inter-agent connection patterns with a lightweight, sample-aware controller for fine-tuning LLMs to select ideal graph topologies.
It consistently outperforms state-of-the-art single-agent models and existing multi-agent systems across various LLM backbones and tasks, demonstrating the pivotal role of per-input structural flexibility.

Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs

TAP-GPT (Tabular Alzheimer's Prediction GPT): introduces a novel framework for few-shot Alzheimer's Disease diagnosis on tabular biomarker data, utilizing TableGPT2 (a multimodal tabular-specialized LLM) with its semantic table encoder and QWen2.5 LLM decoder, adapted via few-shot tabular prompts and qLoRA finetuning.
This framework repurposes TableGPT2, originally designed for business intelligence, to classify AD versus cognitively normal individuals from biomarker tables, demonstrating effective performance with limited sample sizes.
TAP-GPT provides interpretability through generated natural language rationales for its predictions, which is crucial for clinical settings and supports the development of future LLM-driven multi-agent systems in biomedical informatics.

GEAK: INTRODUCING TRITON KERNEL AI AGENT & EVALUATION BENCHMARKS

GEAK (Generating Efficient AI-centric GPU Kernels): introduces an agentic framework for automatic Triton kernel generation, leveraging LLMs within an Iterative Scaling loop that includes a Generator (LLM-based code producer), an Evaluator (tests code correctness/performance), a Reflector (LLM-based error analysis), and an Optimizer (LLM-based performance enhancement), further enhanced by Parallel Scaling.
The framework iteratively refines generated Triton GPU kernels for AMD Instinct™ GPUs, aiming to achieve near-expert performance and reduce manual optimization efforts.
GEAK significantly outperforms direct LLM prompting and Reflexion-based pipelines in correctness and execution speed on TritonBench-revised and ROCm Triton benchmarks.

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS (Genomic data analysis through LLM-based Multi-Agent System): introduces a multi-agent framework for scientific discovery via code-driven gene expression analysis, orchestrating six specialized LLM agents through a guided-planning framework and typed message-passing protocols.
The framework reframes scientific agents as collaborative programmers that generate, revise, and validate executable code, bridging the gap between general reasoning and precision-driven scientific computation.
It achieves state-of-the-art performance on gene expression analysis benchmarks by balancing structured workflows with autonomous adaptation, robust error handling, and efficient code reuse mechanisms.

30th July 2025

Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

CTAE-M (Chinese Textual Ambiguity Evaluation Methodology): introduces a comprehensive evaluation framework to assess the trustworthiness and fragility of LLMs when encountering Chinese textual ambiguity, utilizing a new benchmark dataset, various prompting strategies, and specific evaluation tasks.
The methodology employs a human-annotated benchmark of 900 ambiguous Chinese sentences categorized into lexical, syntactic, and semantic-pragmatic types, each with multiple interpretations and disambiguated pairs.
It systematically evaluates LLMs across ambiguity detection, understanding, and end-to-end tasks, analyzing their overconfidence, overthinking, and inability to reliably distinguish ambiguous from unambiguous text.

ChatVis: Assisting and Evaluating Large Language Models for Generating Scientific Visualizations

ChatVis: introduces an LLM assistant for generating scientific visualizations, with User Prompts, LLM (Operations), ParaView Code base, ParaView Documentation, Embedding Models, Vector DB, Retrieved Context, LLM (Generation), Code Solution, Code Correction (Loop), and Visualization, designed to aid LLMs in generating Python code for ParaView scientific visualization tasks without retraining.
The framework employs chain-of-thought prompt simplification, retrieval-augmented prompt generation using a vector database of documentation and code examples, and iterative error checking.
It significantly improves performance across various metrics compared to unassisted LLMs, demonstrating enhanced accuracy and reliability in generating visualization scripts.

Beyond Rigid AI: Towards Natural Human-Machine Symbiosis for Interoperative Surgical Assistance

Perception Agent: introduces an AI-driven system for real-time, on-demand segmentation of known and novel surgical elements, integrating Speech-incorporated LLMs (interprets natural language), Memory Repository (stores element memory), CoTracker3 (tracks video points), SAM2 (generates segmentation masks), Object-Centric Segmentation Mechanism (identifies novel instruments by motion), and Reference-Based Segmentation Mechanism (segments novel elements using reference).
The system facilitates natural human-machine interaction through speech-based input and hands-free, motion-based prompting for segmenting novel elements.
It enhances surgical assistance by overcoming the rigidity of traditional AI solutions, enabling continuous learning and adaptation to dynamic surgical environments.

SCREENCODER: ADVANCING VISUAL-TO-CODE GENERATION FOR FRONT-END AUTOMATION VIA MODULAR MULTIMODAL AGENTS

ScreenCoder (Modular Multi-Agent Framework): introduces a modular multi-agent framework for UI-to-code generation, with Input (UI screenshots/sketches), Grounding Agent (detects/labels UI components), Planning Agent (constructs hierarchical UI layout), Generation Agent (synthesizes HTML/CSS code), Output (generated webpage/code), and a Scalable Data Engine (generates UI-code training data), which decomposes the task into interpretable stages for robust front-end automation.
This framework leverages a Vision-Language Model for component grounding, applies front-end engineering priors for hierarchical layout planning, and uses adaptive prompt-based synthesis for HTML/CSS code generation, including a placeholder mapping strategy for image restoration.
The framework also functions as a scalable data engine, automatically producing large-scale image-code pairs to fine-tune and reinforce open-source LLMs, achieving state-of-the-art performance in layout accuracy, structural coherence, and code correctness.

The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach

KnowledgeMind: introduces a multi-agent LLM system for fault localization, with Anomaly Alarm Agent, Alarm Graph Agent, Fault Mining Tree, Monte Carlo Tree Search (MCTS), Metric Agent, Trace Agent, Log Agent, Verifier Agent, Knowledge Base Agent, Service-Pod Agent, and various Tools, where it leverages MCTS and a knowledge base reward mechanism for service-by-service reasoning to identify root causes in microservice systems.
The framework standardizes the reasoning process by constructing a Fault Mining Tree and utilizing rule-based rewards to mitigate LLM hallucinations and reduce context window length requirements.
It integrates specialized agents for metrics, traces, and logs, enhancing diagnostic capabilities and improving root cause localization accuracy compared to existing LLM-based RCA methods.

MASCA: LLM based-Multi Agents System for Credit Assessment

MASCA (LLM based-Multi Agents System for Credit Assessment): introduces an LLM-driven multi-agent system for credit assessment, featuring a layered architecture with specialized LLM-based agents for data ingestion, contextualization, multidimensional assessment, and strategic optimization.
The framework's hierarchical structure, inspired by Signaling Game Theory, decomposes complex credit assessment into sub-tasks handled by collaborative agents, enhancing accuracy, fairness, and adaptability.
It integrates contrastive learning for risk and reward assessment to optimize decision-making, providing a robust and explainable system for financial applications, particularly credit scoring.

OFCNETLLM: LARGE LANGUAGE MODEL FOR NETWORK MONITORING AND ALERTNESS

OFCNETLLM (Large Language Model for Network Monitoring and Alertness): introduces a multi-agent LLM-based framework for network monitoring, enhancing anomaly detection, root-cause analysis, and incident analysis, with Monitoring (processes network data), Summary Agent (summarizes network data), Error Prediction Agent (predicts network errors), Sentiment Analysis Agent (analyzes network sentiment), Traffic Extrapolation Agent (extrapolates network traffic), Reporting Agent (generates network reports), Database Tools (manages monitoring databases), LLAMA (open-source LLM model), LangChain (LLM agent framework), ML Training Tools (machine learning training), Identification (classifies data segments), Solution (analyzes data protocols), and Report to Host/Operator (delivers network insights).
The framework leverages specialized LLM-based agents and integrated tools to process network data, identify patterns, and manage monitoring databases.
It employs a multi-stage reasoning process to systematically identify network problems, analyze large datasets, and generate actionable reports for efficient network management.

MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines

MetaAgent: introduces an FSM-based framework for automatically generating and optimizing multi-agent systems, featuring a Designer LLM, Finite State Machine with States and Transitions, Task-Solving Agents, Condition Verifiers, Listener Agents, an Adaptor LLM for optimization, and integrated Tools.
This framework designs agents and organizes them into an FSM, where states define sub-tasks, agents execute actions, condition verifiers manage transitions, and listener agents receive outputs, enabling dynamic problem-solving and state traceback.
The system optimizes the FSM by merging redundant states using an Adaptor LLM, enhancing robustness and performance without external training data, and supports tool-using for real-world interaction.

Strategic Communication and Language Bias in Multi-Agent LLM Coordination

FAIRGAME (computational framework): introduces a system for simulating strategic interactions among LLM-based agents, including a communication layer, configuration file, prompt template, and various game scenarios, utilizing LLMs like GPT-4o and Llama 4 Maverick.
This framework enables controlled experimentation across different models, languages, and behavioral setups to investigate how explicit communication influences collective behavior and biases.
The study extends FAIRGAME to support inter-agent dialogue, allowing for systematic comparison of interactive and non-interactive conditions in game-theoretic environments.

GIT CONTEXT CONTROLLER: MANAGE THE CONTEXT OF LLM-BASED AGENTS LIKE GIT

GCC (Git-Context-Controller): introduces a structured context management framework for LLM-based agents, with Git-inspired operations and a version-controlled file system, including a persistent file system, .GCC/ directory, main.md, branches/ directory, / directory, commit.md, log.md, metadata.yaml, and callable commands like COMMIT, BRANCH, MERGE, and CONTEXT, enabling agents to manage long-horizon goals and structured reflection.
The framework elevates agent context from passive token streams to a navigable, versioned memory hierarchy, supporting multi-level context retrieval and isolated exploration via branching.
Equipped LLM-based agents with GCC achieve state-of-the-art performance on SWE-Bench-Lite, demonstrating improved task resolution and the emergence of recursive self-improvement in a self-replication case study.

AutoCodeSherpa: Symbolic Explanations in AI Coding Agents

AutoCodeSherpa: introduces a framework for symbolic bug explanations, generating input, infection, and output conditions using LLM agents and program analysis tools, including PBT-generating agent, Code exploration agent, Infection condition generating agent, PBT execution and manipulation tools, Command line and file reading tools, Condition injection and test execution tools, Input condition, Infection conditions, Output condition, Buggy program, and Fixed program.
This multi-agent system helps developers understand bugs, assess patch correctness, and improve other AI agents' effectiveness by providing executable explanations.
The framework's symbolic explanations, derived from natural language issue descriptions, capture the bug's trigger, propagation, and symptoms, enhancing trust in AI-generated fixes.

Mitigating Response Delays in Free-Form Conversations with LLM-powered Intelligent Virtual Agents

The System Architecture: introduces a pipeline for LLM-powered intelligent virtual agents in VR, integrating Unity (VR application environment), Microphone Listener (captures user speech), Audio Player (plays agent voice), OVR Lip Sync (animates agent mouth), ASR Model (transcribes user speech), Conversation Handler (manages dialogue flow), Message History (stores conversation context), Transition Check (identifies task transitions), LLM (generates agent responses), and TTS API (converts text to voice).
This system investigates the impact of response delays and conversational fillers on user perception and experience in free-form conversations within virtual reality.
The research demonstrates that natural conversational fillers improve perceived response time, especially in high-delay conditions, and provides an open-source pipeline for deploying such agents.

An Explainable Emotion Alignment Framework for LLM-Empowered Agent in Metaverse Service Ecosystem

Explainable Emotion Alignment Framework: introduces an LLM-empowered agent framework that integrates factual factors into decision-making, enabling agents to achieve more relational fact alignment through emotional data clustering, evolution, self-explanation, and knowledge storage.
The framework enhances LLMs' comprehension of knowledge-emotion dependencies and establishes an emotional evolution system for more human-like decisions and behaviors in social simulation.
Simulation experiments in an Offline-to-Offline food delivery scenario validate the framework's effectiveness in achieving more realistic social emergence and lower order rejection rates.

DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router

DeepSieve (Information Sieving via LLM-as-a-Knowledge-Router): introduces a novel RAG method that incorporates information sieving via an LLM-as-a-knowledge-router, which dynamically decomposes queries, routes sub-questions to heterogeneous knowledge sources, and iteratively refines answers through a multi-stage process.
This modular and transparent approach addresses the limitations of traditional RAG pipelines by providing fine-grained control over query and source sides, enhancing reasoning depth and retrieval precision.
The framework demonstrates superior performance across multi-hop QA benchmarks with heterogeneous sources, achieving higher accuracy and token efficiency compared to existing RAG and agentic baselines.

Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities

Experimental Framework: introduces a systematic evaluation of LLMs as assistive agents in discrete choice modeling, utilizing Large Language Models (LLMs), Input Data, Prompting Strategies, Information Settings, Modelling Goals, Generated MNL Specifications, Self-Generated Code, External Estimation, and Evaluation Metrics to assess their capabilities in model specification and estimation.
The framework benchmarks thirteen LLM versions across five experimental configurations, varying prompting strategies (Zero-Shot vs. Chain-of-Thought), information availability (full dataset vs. data dictionary), and modeling goals (suggesting vs. suggesting and estimating Multinomial Logit models).
Findings indicate that structured prompts and limited raw data access can enhance LLM performance in generating plausible specifications, with GPT-03 uniquely capable of end-to-end estimation via self-generated code, while open-weight LLMs generally underperformed.

29th July 2025

CoEx – Co-evolving World-model and Exploration

CoEx (Co-evolving World-model and Exploration): introduces a hierarchical agent architecture that enables LLM planning to co-evolve with a dynamically updated world model, featuring a Planner (generates abstract subgoals), an Actor (executes subgoals, low-level actions), and an Adaptive Belief State (adaptable world model representation) comprising Symbolic Memory (code-based, object-oriented facts), Structured Textual Memory (natural language, higher-level understanding), and a Verification and Synthesis Module (updates belief state).
This framework addresses exploitation bias and limited adaptation in monolithic LLM agents by decoupling planning and exploration at the subgoal level and integrating new observations into a persistent, explicit world model.
The agent demonstrates superior performance in planning and exploration across diverse text-based environments like ALFWorld, PDDL, and Jericho by leveraging its neurosymbolic belief state and dynamic replanning capabilities.

Promoting Online Safety by Simulating Unsafe Conversations with LLMs

Simulating Scam Conversations to Increase Resilience: introduces a system that promotes online safety by simulating unsafe conversations between a scammer LLM and a target LLM, where users provide feedback to the target LLM.
This system leverages distinct LLM personalities, configured via prompt engineering, to create realistic scam scenarios for user interaction and learning.
The approach aims to help users develop mental models and resilience against online scams by actively engaging them in preventing the target LLM from divulging sensitive information.

CTG-Insight: A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification

CTG-Insight: introduces a multi-agent LLM framework for cardiotocography analysis and classification, with CTG Trace (fetal monitoring data input), Feature Agents (parallel feature analysis), and Aggregator Agent (holistic classification and explanation).
The framework decomposes CTG interpretation into five medically defined features—baseline, variability, accelerations, decelerations, and sinusoidal pattern—each analyzed by a dedicated LLM agent.
An aggregation LLM agent then synthesizes these individual feature analyses to provide a comprehensive fetal health classification with natural language explanations, mirroring clinical reasoning.

Validating Generative Agent-Based Models of Social Norm Enforcement: From Replication to Novel Predictions

LLM Agent Architecture: introduces a systematic two-stage validation approach for generative agent-based models (GABM) of social norm enforcement, which includes an Observation Summary (processes game information), Situation Assessment (evaluates decision context), Decision (generates agent action), Persona Component (models individual differences), Theory of Mind Component (reasons about others), Strategic Reflection Component (optimizes long-term payoff), and Emotion Reflection Component (models emotional responses).
The paper validates these LLM agent architectures by replicating known human behaviors in social dilemma paradigms, such as the Trust Game and Public Goods Game, and then uses the validated models to simulate novel conditions and generate predictions.
This framework enables systematic hypothesis testing about which cognitive mechanisms are necessary for reproducing human social behavior, providing a rigorous method for evaluating generative agent models and advancing understanding of social dynamics.

UserBench: An Interactive Gym Environment for User-Centric Agents

UserBench: introduces a user-centric gym environment designed to evaluate LLM agents in multi-turn, preference-driven interactions, with all its components including Data Gathering, Tool Augmentation, Environment, and Interface, where it simulates realistic user communication traits like underspecification, incrementality, and indirectness in travel planning scenarios.
The environment features a standardized interaction interface and a stable tool-use backend, enabling rigorous and reproducible evaluation of agent performance in understanding and aligning with user intent.
The framework provides a scalable and modular setup for benchmarking and training LLM agents to become collaborative partners rather than just task executors.

Exploring the Stratified Space Structure of an RL Game with the Volume Growth Transform

PPO-TransformerXL: introduces a framework to explore the geometric structure of the embedding space of a transformer model trained for reinforcement learning, utilizing a Visual Encoder (processes raw visual observations), Token Embedding Layer (converts CNN output to token embeddings), Transformer-XL Blocks (processes sequential token embeddings, leveraging memory), Value Head (predicts state values for PPO), Policy Head (outputs action probabilities for PPO), PPO Algorithm (optimizes policy and value functions), and Memory Window (manages recurrent state for Transformer-XL).
The paper investigates how a transformer-based PPO model embeds visual inputs from a "Searing Spotlights" RL game, finding that the token embedding space is better modeled as a stratified space with varying local dimensions rather than a manifold.
This research adapts the Volume Growth Transform from LLM analysis to the RL setting, suggesting that the distribution of dimensions in a stratified latent space can serve as a new geometric indicator of complexity for RL games.

Towards Cognitive Synergy in LLM-Based Multi-Agent Systems: Integrating Theory of Mind and Critical Evaluation

CSF (Cognitive Synergy Framework): introduces a multi-agent system framework that integrates dynamic Theory of Mind (ToM) and structured critical evaluation to enhance collaborative reasoning in LLM-based systems, featuring an Orchestrator, Specialized Agents, a Critic Agent, an Integrator, and a Knowledge Base.
This framework aims to achieve cognitive synergy by enabling agents to model others' perspectives and systematically critique arguments, leading to more coherent, adaptive, and rigorous interactions.
The system leverages LLMs for agent intelligence and external tools like Neo4j and Clingo for knowledge management and logical reasoning, demonstrating improved argument quality and risk resolution in complex decision-making scenarios.

MapAgent: Trajectory-Constructed Memory-Augmented Planning for Mobile Task Automation

MapAgent: introduces a novel LLM-based agent framework that leverages memory constructed from historical trajectories to augment current task planning, with a Trajectory-based Memory Mechanism (condenses historical trajectories), Page-Memory Database (structured long-term memory), Memory-Augmented Task Planning (coarse-to-fine planning), and Task Executor (dual-LLM execution engine).
The framework transforms task execution trajectories into reusable page chunks stored in a database, enabling the agent to retrieve relevant pages for informed and context-aware planning.
Its dual-LLM architecture, comprising a Decision-maker and a Judge, ensures effective tracking of task progress and handles complexities in mobile environments.

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

MMAT-1M (Multi-Modal Agent Tuning-One Million): introduces a novel four-stage data engine for multimodal agent tuning, including a Foundation Stage (curates multimodal data), Rationale Generation Stage (generates reasoning trajectories), Reflection Stage (refines rationales), and Integration Stage (formats dialogues).
This framework leverages GPT-4o and various API/RAG tools (Image Caption, OVD, OCR, Face Detection, RAG) to create a million-scale dataset supporting Chain-of-Thought, reflection, and dynamic tool usage.
The dataset provides both one-turn (ORR) and multi-turn (RR) formats, demonstrating significant performance gains for fine-tuning open-source multimodal models across diverse benchmarks.

GRAPH-R1: TOWARDS AGENTIC GRAPHRAG FRAME-WORK VIA END-TO-END REINFORCEMENT LEARNING

Graph-R1 (Agentic GraphRAG Framework): introduces an agentic GraphRAG framework via end-to-end reinforcement learning, featuring a Graph-R1 Agent (LLM-driven agent), Knowledge HyperGraph (GH) (structured knowledge environment), and Reinforcement Learning (RL) (end-to-end optimization).
The framework models retrieval as a multi-turn agent-environment interaction, optimizing the agent process through an outcome-directed Reward Function (R(τ)) that integrates generation quality, retrieval relevance, and structural reliability.
Graph-R1 leverages lightweight knowledge hypergraph construction and dual-path hypergraph retrieval to enhance reasoning accuracy, retrieval efficiency, and generation quality.

Prototyping Compliance: Participatory Legal UX for Platform Reporting Mechanisms under the DSA

Participatory Legal UX: introduces a qualitative case study examining how designers mediate between abstract legal requirements and real-world digital experiences for users, focusing on the design of content reporting mechanisms under Article 16 of the DSA, through an expert workshop utilizing participatory design methods, user personas, usability heuristics, and legal obligations to evaluate UI flows and generate compliance-fostering design solutions.
The study highlights critical usability barriers in existing reporting systems, such as poor discoverability, legalistic language, and lack of feedback, proposing participatory design as a bridge for disciplinary divides.
Findings emphasize the crucial role of designers in shaping policy and law by translating regulatory intentions into concrete digital experiences and resolving value tensions.

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Multi-Agent LLM Frameworks: introduces "Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?", with Dataset Integration (Combines/standardizes datasets), LLM-Based Filtering (Filters unrelated code changes), RVG Context & Threat Modeler (Creates attack scenarios), RVG Vulnerable Implementer (Generates vulnerable code), RVG Security Auditor (Identifies/remediates vulnerabilities), RVG Security Reviewer (Validates vulnerability/remediation), Cross-Model Validation (Validates synthesized data), TITANVUL Vulnerability Auditor (Assesses vulnerability fixes), TITANVUL Vulnerability Critic (Reviews auditor's findings), TITANVUL Vulnerability Consensus (Synthesizes assessments/scores), and Manual Review (Verifies/validates data), where the paper addresses the generalization gap in automated vulnerability detection through new datasets and a data synthesis framework.
The paper introduces BENCHVUL, a manually curated benchmark, and TITANVUL, a large-scale high-quality training dataset, both designed to improve model generalization by mitigating data quality issues and imbalances.
Empirical results demonstrate that models trained on TITANVUL, especially when augmented with RVG-generated data, achieve significantly higher generalization accuracy compared to models trained on existing datasets.

StaffPro: an LLM Agent for Joint Staffing and Profiling

StaffPro (LLM Agent): introduces an LLM agent for joint staffing and profiling, integrating a Staffing Module (generates task schedules), a Profiling Module (estimates worker attributes), and a Long-term memory (stores historical data) to continuously improve personnel management.
The Staffing Module leverages an LLM for evaluating optimization criteria and aggregates scores before a Scheduler generates feasible schedules, which are then proposed to Workers for acceptance or refusal.
The Profiling Module, utilizing an LLM for analysis and reflection, processes feedback from Workers (self-evaluations, task acceptance/refusal) and Supervisors (performance reviews) to update the Worker profiling data in the Long-term memory, enhancing future staffing decisions.

Large Language Models for Wireless Communications: From Adaptation to Autonomy

Large Language Models for Wireless Communications: introduces a paradigm for transforming wireless systems by adapting pretrained LLMs for core communication tasks, developing wireless-specific foundation models, and enabling agentic LLMs with autonomous reasoning and coordination capabilities.
The paper details how LLMs can be adapted for physical layer prediction, resource allocation, and semantic communication, addressing modality mismatches and enhancing generalization.
It further explores the development of compact, domain-specific wireless foundation models for efficiency and multi-task generalization, and agentic LLMs for self-organizing, adaptive wireless networks through reasoning, memory, and tool use.

Evaluation and Benchmarking of LLM Agents: A Survey

Taxonomy for LLM-based Agent Evaluation: introduces a two-dimensional framework for evaluating LLM agents, encompassing Evaluation Objectives (what to evaluate) and Evaluation Process (how to evaluate), where Evaluation Objectives cover agent behavior, capabilities, reliability, and safety/alignment, and Evaluation Process includes interaction modes, data, metrics, tooling, and contexts.
This taxonomy aims to clarify the fragmented landscape of LLM agent evaluation, providing a systematic assessment framework for real-world deployment.
The paper also highlights enterprise-specific challenges like role-based access, reliability guarantees, and long-horizon interactions, and identifies future research directions for holistic, realistic, scalable, and efficient evaluation.

Transmission With Machine Language Tokens: A Paradigm for Task-Oriented Agent Communication

TMLT (Transmission With Machine Language Tokens): introduces a task-oriented agent communication system that leverages LLMs to learn specialized machine language tokens for efficient multi-modal information transmission, comprising an Agent Semantic Transmitter (multi-modal input processing), Joint Token and Channel Coding (token compression/robustness), Orthogonal Frequency Division Analog Transmission (analog signal transmission), Joint Token and Channel Decoding (token reconstruction), and an Agent Semantic Receiver (downstream task execution).
This system enables agents to communicate task-relevant information compactly and robustly by converting natural language and multi-modal inputs into machine-interpretable token embeddings transmitted over noisy wireless channels.
The approach employs end-to-end training with Low-Rank Adaptors (LoRA) to optimize for downstream tasks, significantly reducing transmission overhead and latency while maintaining accuracy.

Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour

LiTransMC (Locally Deployable Fine-Tuned Causal Large Language Model for Mode Choice Behaviour): introduces a fine-tuned causal LLM for travel mode choice prediction, with a Base LLM (foundational causal LLM), Data Ingestion Module (processes structured survey data), Prompt Engineering Module (constructs prompts with system instructions, data, few-shot examples), Inference Engine (manages LLM querying and response generation), Response Processing Module (parses LLM output into structured predictions and reasoning), Fine-tuning Module (adapts base LLM for mode choice prediction), and Evaluation Module (assesses predictive performance and reasoning quality), demonstrating the feasibility of creating specialist, locally deployable LLMs that integrate prediction and interpretability.
LiTransMC achieves state-of-the-art performance in weighted F1 score and Jensen-Shannon Divergence, surpassing untuned local models, larger proprietary systems, and classical mode choice methods.
The framework combines structured behavioral prediction with natural language reasoning, enabling conversational, multi-task transport models for agent-based simulations, policy testing, and behavioral insight generation.

MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations

MemTool: introduces a short-term memory framework for LLM agents, with Autonomous Agent Mode (full tool management autonomy), Workflow Mode (deterministic control without autonomy), and Hybrid Mode (combining autonomous and deterministic control).
This framework enables LLM agents to dynamically manage tools or Model Context Protocol (MCP) server contexts across multi-turn conversations, addressing the limitations of fixed context windows in repeated tool usage scenarios.
Evaluated across 13+ LLMs, MemTool demonstrates varying tool removal efficiencies and task completion rates across its modes, providing insights into effective short-term tool memory management.

Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects

GLA (Graph-Augmented Large Language Model Agents): introduces a comprehensive overview of recent advances and future prospects in integrating graphs with LLM agents, enhancing their planning, memory, tool usage, and multi-agent system capabilities.
The paper categorizes existing GLA methods by their primary functions, analyzing how various graph types and learning algorithms contribute to each module.
It highlights key future directions for GLA, including dynamic graph learning, unified graph abstractions, multimodal graphs, and large-scale multi-agent system simulation.

28th July 2025

MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration

MAAD (Multi-Agent Architecture Design): introduces an automated framework for software architecture design, orchestrating four specialized LLM agents—Analyst (requirements analysis), Modeler (architecture blueprint generation), Designer (detailed documentation), and Evaluator (architecture assessment)—to collaboratively produce architectural blueprints and evaluation reports.
The framework integrates a knowledge source via Retrieval-Augmented Generation (RAG) to infuse external knowledge into the Modeler and Designer agents, enhancing design quality and mitigating hallucinations.
MAAD demonstrates superior performance in generating comprehensive and fine-grained architectural solutions compared to baseline multi-agent systems, emphasizing the critical impact of LLM selection on design quality.

ProMemAssist: Exploring Timely Proactive Assistance Through Working Memory Modeling in Multi-Modal Wearable Devices

ProMemAssist: introduces a smart glasses system that models a user's working memory in real-time using multi-modal sensor signals, including a Working Memory (WM) Model, Assistance Generator (LLM), and Timing Predictor Module.
This system encodes visuospatial and phonological memory items into an episodic buffer, informing a timing predictor that balances assistance value with interruption cost.
By leveraging WM modeling, the system delivers more selective and context-sensitive proactive assistance, leading to higher user engagement and reduced frustration compared to an LLM baseline.

Games Agents Play: Towards Transactional Analysis in LLM-based Multi-Agent Systems

Trans-ACT (Transactional Analysis Cognitive Toolkit): introduces a novel framework that embeds Transactional Analysis principles into Multi-Agent Systems to create agents with realistic psychological dynamics, featuring an Agent orchestrator, Parent, Adult, and Child ego state agents, a Memory Tool with distinct memory types for each ego state, a Life Script, LLMs for reasoning, and prompts for input.
The framework structures agent behavior around distinct ego states, each modeled as a ReAct agent within a LangGraph framework, dynamically activating internal schemas via similarity-based memory retrieval to guide responses consistent with human cognition.
Trans-ACT aims to enhance the psychological depth of AI agents, supporting applications in conflict resolution, educational support, and social psychology studies by simulating complex behavioral dynamics.

Agentic Web: Weaving the Next Web with AI Agents

Agentic Web: introduces a structured framework for understanding and building a new internet paradigm where autonomous AI agents, powered by LLMs, act as intermediaries to plan, coordinate, and execute goal-directed tasks on behalf of users.
This framework integrates core architectural components like User Clients, Intelligent Agents, and Backend Services, supported by communication protocols (MCP, A2A) and a billing ledger (CABL), to enable machine-to-machine interactions.
The Agentic Web redefines information flow and value creation through its three conceptual dimensions—Intelligence, Interaction, and Economy—shifting from human-driven consumption to autonomous, goal-driven task execution.

Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

MAJ-EVAL (Multi-Agent-as-Judge Evaluation Framework): introduces an LLM-based multi-agent evaluation framework that automatically constructs evaluator personas and orchestrates in-group debates to generate multi-dimensional feedback.
The framework's Stakeholder Persona Creation Module leverages the Evaluative Dimension Extraction LLM (Me) to identify stakeholder perspectives and the Dimension-Based Persona Construction LLM (MƉ) to construct detailed agent personas.
These LLM Agents engage in a Multi-Agent-as-Judge Debate Evaluation, where an In-Group Moderator coordinates discussions, agents refine their evaluations via Memory Update, and an Aggregator Agent synthesizes final scores.

MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them

MIRAGE-Bench: introduces a unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios, with a Categorization Module (Classifies hallucinations) using a Taxonomy (Defines three types of unfaithfulness) including Task Instruction Unfaithfulness (Violates task goals/constraints), Interaction History Unfaithfulness (Contradicts past actions/outcomes), and Environment Observation Unfaithfulness (Misrepresents environment state), an Elicitation Module (Generates hallucination-prone scenarios) employing a Contextual Snapshot Strategy (Freezes agent state for reproducibility), and an Evaluation Module (Assesses hallucination behaviors) utilizing an LLM-as-a-Judge Paradigm (Uses an LLM to score agent faithfulness) with a Judge LLM (Performs semantic reasoning for evaluation).
The benchmark systematically audits existing agent benchmarks to identify hallucination-prone risk settings and synthesizes test cases using a snapshot strategy to isolate decision points for deterministic and reproducible analysis.
The framework adopts a fine-grained LLM-as-a-Judge paradigm with tailored risk-aware prompts to enable scalable, high-fidelity assessment of agent actions without enumerating full action spaces.

Core Safety Values for Provably Corrigible Agents

Corrigible Utility Set Framework: introduces an implementable framework for AI corrigibility, with provable guarantees in multi-step, partially observed environments, by replacing a single opaque reward with five structurally separate utility heads—deference, switch-access preservation, truthfulness, low-impact behavior, and bounded task reward—combined lexicographically by strict weight gaps.
The framework operates within a Partially Observable Off-Switch Game (PO-OSG) environment, modeling agent-human interactions, self-spawning agents, and gradual loss of control, ensuring safety properties are bounded while maintaining net human benefit.
The paper demonstrates that verifying safety of arbitrary post-hack agents is undecidable in open-ended environments but carves out a finite-horizon "decidable island" where safety can be certified with privacy-preserving, constant-round zero-knowledge proofs.

Aligning Large Language Model Agents with Rational and Moral Preferences: A Supervised Fine-Tuning Approach

SFTA (Supervised Fine-Tuning Approach): introduces a pipeline to align LLM agents with rational and moral preferences, using synthetic datasets derived from economic reasoning.
The approach fine-tunes a GPT-4o LLM on structured chat interactions, embedding homo economicus (self-interest) and homo moralis (Kantian universalizability) utility functions.
Evaluations in economic games, moral dilemmas, and algorithmic pricing demonstrate improved behavioral consistency and interpretability compared to baseline LLMs.

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

AI Agent Red Teaming Challenge: introduces a large-scale public competition for evaluating the security of LLM-powered AI agents, featuring adversarial attacks by red-teamers against agents operating in realistic environments with various tools, memory, and web access, all governed by specific policy types.
The challenge, hosted on the Gray Swan Arena, involved 1.8 million prompt injection attacks across 44 scenarios and 22 frontier LLMs, revealing widespread policy violations and high attack transferability.
The competition's results led to the creation of the ART benchmark, a dataset of high-impact attacks designed to support more rigorous security assessment and drive progress toward safer agent deployment.

AQUA: A Large Language Model for Aquaculture & Fisheries

AQUADAPT (Aquaculture Data Acquisition, Processing and Tuning): introduces a structured, agentic framework for generating and refining high-quality, domain-relevant datasets to train AQUA, a large language model for aquaculture.
The framework integrates an Expert Agent for human-in-the-loop data curation, a Data Agent for corpus acquisition and preprocessing, a QA Agent for dual-path question-answer generation, and a Scoring Agent for automated quality assessment and dataset filtering.
This methodology ensures domain accuracy and contextual fluency, enabling AQUA to provide intelligent insights and enhance operational efficiency in aquaculture.

Integrating LLM in Agent-Based Social Simulation: Opportunities and Challenges

LLM-augmented Agent-Based Social Simulation: introduces a hybrid approach for social simulation that integrates Large Language Models (LLMs) as core agent intelligence, featuring an LLM Instance/Session (core agent intelligence), a Memory System (stores past experiences), a Reflection and Summarization Layer (processes observations into mental models), a Planning Component (generates actions based on reflections), an Orchestration Layer (manages simulation time and agent interactions), External Tools/APIs (augment agent capabilities), and often integrated with Traditional ABM Platforms (provide structured environment and analysis).
This framework leverages LLMs' capacity for human-like language generation and social reasoning to create more flexible and expressive agents, enabling rapid simulation of large-scale social dynamics and exploration of complex scenarios.
Despite opportunities for enhanced realism and scalability, the approach faces challenges including LLM biases, hallucination, inconsistency, high computational costs, and the "black-box" nature of LLMs, necessitating robust validation and careful scenario scoping.

27th July 2025

MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

MazeEval: introduces a benchmark for evaluating LLMs' spatial reasoning, with maze generation (creates mazes), LLM interaction interface (enables model interaction), and evaluation metrics (assesses performance) components, designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks.
The benchmark challenges LLMs to navigate mazes using only coordinate-based feedback and distance-to-wall information, without visual input, to test fundamental spatial cognition.
MazeEval also includes a multilingual evaluation in English and Icelandic to assess cross-linguistic transfer of spatial abilities and the influence of linguistic resources on spatial reasoning.

Advancing Shared and Multi-Agent Autonomy in Underwater Missions: Integrating Knowledge Graphs and Retrieval-Augmented Generation

RAG (Retrieval-Augmented Generation) System: introduces a framework for advancing shared and multi-agent autonomy in underwater missions, with Information Retrieval System (retrieves data), Mission Behaviors Generator (produces actions), Large Language Model (reasoning and decision-making), BT Manager (manages behavior trees), Context Manager (monitors variables), Task Execution (executes actions), Human-in-the-Loop (human interaction point), VLC Human Computer (human-robot communication), VLC Robot (robot communication module), Remote Sensor (data source), Data Processing (sensor data handling), Autonomous Underwater Vehicle (robotic agent), and Docking Station (recharging/data transfer point), enabling autonomous decision-making and seamless human-robot interaction for complex underwater tasks.
The framework integrates an LLM with a Knowledge Graph and a structured Taxonomy, allowing AUVs to autonomously plan and execute missions while dynamically incorporating real-time updates and human oversight.
The system leverages Behavior Trees for structured decision-making, ensuring efficient and flexible mission execution adaptable to environmental uncertainties and supporting multi-robot coordination.

Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs

LAPD (Latent Agentic Perturbation Diagnostics): introduces a geometry-aware evaluation framework that systematically probes the latent robustness of clinical LLMs using structured adversarial edits, with all LAPD-components, where synthetic or real clinical notes are processed through structured perturbation and latent embedding projection, and the resulting representations are analyzed for fragility using geometry-aware metrics and surface-level clinical agreement.
The framework introduces Latent Diagnosis Flip Rate (LDFR), a model-agnostic diagnostic signal that captures representational instability when embeddings cross decision boundaries in PCA-reduced latent space.
The paper validates LDFR on real clinical notes, confirming its generalizability beyond synthetic settings and revealing a persistent gap between surface robustness and semantic stability in safety-critical clinical AI.

ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios

ELMES (Evaluation of Large Models in Educational Scenarios): introduces an open-source automated evaluation framework for LLMs in educational settings, with Task Loading, Agent DAG Construction, Dialogue Generation, Result Evaluation, and Data Aggregation & Visualization components, enabling flexible scenario design and objective pedagogical metric quantification.
The framework utilizes a modular architecture, declarative configuration files, and a hybrid evaluation engine (LLM-as-a-Judge) to automate the entire workflow from dialogue generation to multi-dimensional quantitative analysis.
It systematically benchmarks LLMs across four critical educational scenarios—Knowledge Point Explanation, Guided Problem-Solving Teaching, Interdisciplinary Lesson Plan Generation, and Contextualized Question Generation—using fine-grained, expert-developed metrics.

SciToolAgent: A Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration

SciToolAgent (Knowledge Graph-Driven Scientific Agent for Multi-Tool Integration): introduces an LLM-powered agent that automates scientific tools by leveraging a SciToolKG (encodes tool relationships, dependencies) and includes Planner (devises strategy), Executor (implements tools), and Summarizer (synthesizes results) components.
The framework integrates a comprehensive Toolset (collection of scientific tools) and a Safety check module (ensures ethical tool usage) supported by a Safeguard database (contains hazardous substances data) for responsible tool automation.
SciToolAgent utilizes a Chain-of-Tools (planned sequence of tools) and a Memory module (stores context for queries) to enable intelligent tool selection, execution, and iterative Re-planning (iterative plan refinement) for complex scientific workflows.

MLC-Agent: Cognitive Model based on Memory-Learning Collaboration in LLM Empowered Agent Simulation Environment

MLC-Agent (Cognitive Model based on Memory-Learning Collaboration): introduces an individual agent model for LLM-empowered agent simulation environments, integrating memory and learning mechanisms for enhanced decision-making, with components including Individual Perception, Decision-Making Mechanism (Learning Model, Memory Model), Behavior Set, Interaction Module, and External Knowledge Base.
The framework employs a hierarchical memory structure, comprising an Individual Memory Set, Collective Memory Set, and Memory Buffer Pool, alongside a multi-indicator evaluation mechanism for dynamic memory updates and collaborative decision-making.
This integration promotes knowledge sharing and dissemination among agents, enabling them to continuously optimize decision-making by combining contextual knowledge in dynamic environments, leading to improved adaptability and anthropomorphic characteristics.

Goal Alignment in LLM-Based User Simulators for Conversational AI

UGST (User Goal State Tracking): introduces a novel framework and a three-stage methodology for developing goal-aligned LLM-based user simulators, which includes Inference-time Steering (conditions simulator with goal state), Cold-Start Supervised Fine-Tuning (SFT) (trains simulator for autonomy), and Group Relative Policy Optimization (GRPO) (refines simulator with rewards), aiming to address goal misalignment in multi-turn conversations.
The framework dynamically tracks a User Goal State (structured goal representation) across conversations, decomposing user goals into sub-components like User Profile, User Policy, Task Objectives, Requirements, and Preferences, each with a dynamic status.
This approach significantly improves LLM-based user simulator (LLM-based agent) goal alignment and response generation by leveraging explicit Reasoning Traces (explicit goal progression steps) and UGST Reward Signals (structured feedback for RL) derived from the tracked goal state.

AI-Driven Generation of Old English: A Framework for Low-Resource Languages

AI-Driven Old English Generation Framework: introduces a scalable framework for generating high-quality Old English texts, employing a multi-stage methodology that includes data preparation, model training (Domain-Adaptive Pretraining and Task-Adaptive Pretraining), and synthetic data generation via a dual-agent pipeline.
The framework leverages parameter-efficient fine-tuning (LoRA) and data augmentation through backtranslation to adapt LLMs for low-resource Old English, significantly expanding its digital corpus.
Evaluation with automated metrics (BLEU, METEOR, CHRF) and expert human assessment confirms substantial improvements in translation quality and linguistic fidelity, offering a blueprint for revitalizing other endangered languages.

26th July 2025

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

cross-modal actor-critic agentic inference framework: introduces a system that jointly refines textual answers and visualization code, with an Actor (generates initial and refined responses) producing outputs and a Critic (evaluates and provides feedback) assessing them using multimodal feedback, including Answer Feedback (numerical correctness), Code Feedback (syntax/semantic checks), and Visual Feedback (chart quality), all within a Refinement Loop (iterative improvement process).
This framework enhances answer accuracy and chart quality by incorporating multimodal feedback, outperforming direct inference methods.
The framework is model-agnostic, routing initial outputs from any baseline inference model into its iterative refinement loop for improved alignment with query intent.

AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation

AgentMesh (A Cooperative Multi-Agent Generative AI Framework for Software Development Automation): introduces a Python-based framework that automates software development by orchestrating specialized LLM-powered agents, including a Planner Agent (decomposes requests, plans tasks), Coder Agent (generates code, implements subtasks), Debugger Agent (tests code, fixes errors), and Reviewer Agent (validates output, quality assurance), all managed by an AgentMesh Orchestrator (manages workflow, coordinates agents) and interacting through a shared Project State (shared codebase, specifications, errors), powered by an LLM Backend (powers agents' intelligence), and utilizing a Sandbox Environment (executes code safely) and Conversation Log (logs agent LLM interactions).
The framework mimics human software teams, enabling agents to communicate via shared artifacts and iteratively refine code through a feedback loop, enhancing reliability and addressing complex tasks more robustly than single-agent approaches.
Implemented in Python using OpenAI's GPT-4, the system demonstrates the potential of structured LLM orchestration in software engineering, offering a modular design for extensibility and future integration of advanced tools or learning capabilities.

Think, Act, Learn: A Framework for Autonomous Robotic Agents using Closed-Loop Large Language Models

GUI-Learner: introduces a novel architecture for autonomous robotic agents, integrating a Perception Module (interprets raw visual information), a Decision Module (selects next action), and a Hybrid Learning Strategy (combines two learning phases) with Behavioral Cloning (initial policy from expert demos) and Offline Reinforcement Learning (refines policy from self-exploration).
This framework enables embodied agents to autonomously learn and refine policies through continuous interaction, establishing a closed-loop cycle where an LLM "thinks" by decomposing commands, "acts" by executing plans and gathering feedback, and "learns" by processing feedback for self-reflection and corrective strategies.
The approach significantly outperforms baseline methods on complex, long-horizon tasks in both simulation and real-world GUI environments, achieving high success rates and generalization to unseen tasks.

AGENTIC REINFORCED POLICY OPTIMIZATION

ARPO (Agentic Reinforced Policy Optimization): introduces a novel agentic RL algorithm tailored for training multi-turn LLM-based agents, with a Rollout Module (generates trajectories), Entropy-based Adaptive Rollout (manages sampling), Policy Model (LLM) (generates responses), Tool Environment (provides external tools), Advantage Attribution Estimation (assigns advantage values), Reference Model (LLM) (provides baseline for KL divergence), Reward Model (provides reward signals), and Group Computation (processes advantages), designed to encourage adaptive branching sampling during high-entropy tool-call rounds and internalize advantage differences in stepwise tool-use behaviors.
The framework incorporates an entropy-based adaptive rollout mechanism that dynamically balances global and partial sampling, promoting exploration at steps with high uncertainty after tool usage.
It integrates an advantage attribution estimation to enable LLMs to internalize advantage differences in stepwise tool-use interactions, achieving improved performance with reduced tool-use budget.

LARGE LANGUAGE MODEL AGENT FOR STRUCTURAL DRAWING GENERATION USING REACT PROMPT ENGINEERING AND RETRIEVAL AUGMENTED GENERATION

LLM Agent (Large Language Model Agent): introduces a novel generative AI-based method for structural drawing generation, employing a chain of LLMs (LLM1, LLM2, LLM3, LLM4, LLM5, LLM6) to process natural language descriptions into AutoCAD Python code, supported by a Knowledge Database, ReAct Prompt Engineering, and Retrieval Augmented Generation.
This multi-LLM pipeline addresses limitations of single LLMs by breaking down complex tasks into subtasks, enhancing efficiency, reliability, and accuracy in converting varied natural language inputs into precise structural drawings.
The approach significantly reduces the labor-intensive and time-consuming process of manual structural drawing production, facilitating iterative design and ensuring compliance with regulatory standards.

Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation

ARG-DESIGNER (AutoRegressive Graph generation model that acts as a MAS topology Designer): introduces a novel autoregressive model for multi-agent system (MAS) communication topology design, which constructs collaboration graphs from scratch by dynamically determining agent roles and communication links through its Node Generation and Edge Generation components.
This framework reframes MAS design as a conditional autoregressive graph generation task, enabling flexible and extensible topology creation precisely tailored to specific task requirements.
The model achieves state-of-the-art performance and superior token efficiency by progressively selecting appropriate agents from an extensible pool and establishing optimal communication links.

25th July 2025

AGENTIC AI FOR AUTONOMOUS ANOMALY MANAGEMENT IN COMPLEX SYSTEMS

Agentic AI (AI agent augmented with large language models, diverse tools, and knowledge-based systems): introduces an autonomous anomaly management framework for complex systems, integrating an AI Agent (core autonomous entity), LLMs (cognitive core for reasoning), Tools (diverse specialized utilities), Knowledge-based Systems (stores domain-specific information), Memory Systems (retains context and knowledge), and an LLM-as-a-judge module (evaluates tool use), to continuously analyze, learn, and respond to abnormal behaviors.
This framework aims to overcome limitations of human-dependent anomaly management by enabling autonomous decision-making, contextual understanding, and real-time adaptation to evolving conditions.
The system synthesizes insights across disciplines, detects subtle patterns, and adapts strategies using both implicit and explicit knowledge, enhancing system resilience and adaptability.

Simulating multiple human perspectives in socio-ecological systems using large language models

HoPeS (Human-Oriented Perspective Shifting): introduces a framework for simulating human perspectives in socio-ecological systems, integrating LLM-powered agents, a simulation protocol, and a prototype system with PTS and RLC components.
The framework enables users to assume stakeholder roles, interact with LLM agents, and reflect on diverse perspectives to deepen understanding of complex socio-ecological dynamics.
It facilitates exploration of institutional dynamics and land use change through narrative-driven and numerical experiments, fostering interdisciplinary collaboration.

"X of Information" Continuum: A Survey on AI-Driven Multi-dimensional Metrics for Next-Generation Networked Systems

HF-AIMDIM (Hierarchical Framework for AI-driven Multi-dimensional Information Metrics): introduces a systematic framework for next-generation networked systems, integrating Fundamental Metric Dimensions, AI Enhancement Technologies, and Application Scenarios to optimize information quality.
The framework structures information metrics along temporal, quality/utility, reliability/robustness, and network/communication dimensions, leveraging AI for adaptive, context-aware optimization.
It illustrates the revolutionary promise of multi-dimensional information metrics for diverse operational needs across critical application domains.

Efficient and Scalable Agentic AI with Heterogeneous Systems

Orchestration and Serving System: introduces a comprehensive system architecture for efficient and scalable execution of dynamic AI agent workloads on heterogeneous compute infrastructure, integrating an API Server, Inference Serving System with Planner & Scheduler, Load Balancer / Request Router, Serving Nodes (each with Runtime, Model Execution, Memory Management, Subgraph Execution, KV Cache, Metrics Collector), High Performance Interconnect, Cache Management, and Object Storage.
This system dynamically plans and places fine-grained computational components onto a distributed fleet of heterogeneous hardware, continuously monitoring node availability, workload characteristics, and resource utilization to optimize throughput and cost efficiency while meeting end-to-end SLAs.
Leveraging an MLIR-based representation for agent workloads, the system enables cost-aware optimization, heterogeneous hardware integration, and dynamic orchestration, allowing optimal mapping of diverse agent tasks to the most cost-effective hardware resources.

MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation with Backend Aware Synthesis Optimization

MCP4EDA (Model Context Protocol for Electronic Design Automation): introduces an LLM-powered Model Context Protocol server that automates the RTL-to-GDSII design flow through natural language interaction, integrating a MCP Host, MCP Clients, MCP Server, and an LLM interacting with various EDA tools across Simulation, Synthesis, and Backend Domains.
The system implements a backend-aware synthesis optimization methodology, where the LLM analyzes post-layout metrics from OpenLane results to iteratively refine synthesis TCL scripts, establishing a closed-loop optimization system.
This approach leverages real backend performance data to guide synthesis parameter tuning and optimization sequence selection, enabling dynamic tool selection and adaptive execution strategies for improved timing closure and area reduction.

CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

CodeEvo: introduces an interaction-driven synthesis framework for high-quality code-centric data, featuring a Coder (generates code and tests) and a Reviewer (guides synthesis process) LLM agents, a Source (provides seed instructions), an Environment (provides compiler feedback), a Hybrid Feedback Mechanism (combines compiler and LLM feedback), Keyword-Guided Instruction Generation (anchors instruction evolution), Synthesized Trajectories (collects instruction-code pairs), and Base Models (models trained on data).
The framework leverages iterative interactions between the Coder and Reviewer, enhanced by a hybrid feedback mechanism that integrates deterministic compiler verification with flexible LLM-based evaluations to ensure functional correctness.
CodeEvo also employs keyword-guided instruction generation to maintain semantic control and progressively increase the difficulty and diversity of synthesized instruction-code pairs.

Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports

Mut4All: introduces a fully automated, language-agnostic framework for synthesizing mutation operators by leveraging LLMs and compiler-specific knowledge from bug reports, with Mutator Invention Agent (identifies mutation targets/generates specifications), Mutator Implementation Synthesis Agent (synthesizes mutator code/fine-tuned), and Mutator Refinement Agent (validates/corrects mutators) components, where it automates the entire mutator lifecycle from discovery and design to implementation for mutation-based compiler fuzzing.
The framework analyzes real-world Bug Reports to identify error-prone language features, formulates Mutator Specifications, synthesizes Raw Mutators, and refines them into Valid Mutators using Test Suites and Feedback.
Mut4All integrates these Valid Mutators into a customized Fuzzer, which utilizes Seed Programs, Crash/Hang Oracles, and Differential Testing to uncover bugs in target Compilers like Rust and C++.

Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

Event-Driven Storytelling Framework: introduces a modular LLM-based framework for generating dynamic 3D scenes with multiple lifelike human characters, decomposing complex multi-agent behavior planning into manageable event sequences.
This framework leverages a High-level Action Planning Module, comprising a Scene Describer, Narrator, and Event Parser, to reason contextually and generate detailed event information for character actions.
It further employs a Low-level Motion Synthesis Module to convert these events into realistic 3D character motions, ensuring collision-free trajectories and diverse interactions within the scene.

iPLAN: Redefining Indoor Wireless Network Planning Through Large Language Models

iPLAN (indoor wireless network Planning with large LANguage models): introduces a framework for optimizing indoor wireless network planning, leveraging its comprehensive set of components including LLM optimizers, multi-modal IE representations, domain knowledge bases, and a multi-agent system for iterative design and evaluation.
This framework addresses challenges in traditional IWN planning by integrating domain-specific knowledge, multi-modal data alignment, and iterative refinement for superior performance and generalization.
iPLAN supports both IWN planning based on pre-existing Indoor Environments and joint design of IWN with new wireless-friendly buildings, demonstrating significant improvements in coverage and efficiency.

Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

DebateCV (Debate-driven Claim Verification): introduces a multi-agent LLM framework for claim verification that simulates human fact-checking debates, leveraging multiple LLM agents and a novel post-training strategy.
The framework employs two Debater LLMs to argue opposing stances on a claim using provided evidences, while a Moderator LLM evaluates arguments and issues a verdict with justifications.
To address data scarcity and improve Moderator performance, the system synthesizes debate data and applies a post-training strategy involving a Corrector LLM for error correction, followed by supervised fine-tuning and direct preference optimization.

Large Language Model-Based Task Offloading and Resource Allocation for Digital Twin Edge Computing Networks

LLM-based context learning: introduces a method for task offloading and resource allocation in digital twin edge computing networks, utilizing MARL to generate an initial case set and an LLM to optimize decisions based on this set and real-time data.
This approach aims to enhance system QoS and energy efficiency by transforming long-term constraints into short-term decisions via Lyapunov optimization.
The framework demonstrates comparable or superior performance to traditional MARL, leveraging LLMs for efficient decision-making in dynamic vehicular environments.

Large Language Model Powered Automated Modeling and Optimization of Active Distribution Network Dispatch Problems

Multi-LLM Coordination Architecture: introduces an LLM-powered automated modeling and optimization approach for Active Distribution Network (ADN) dispatch problems, which decomposes the process into sequential stages handled by specialized LLM agents (Information Extractor, Problem Formulator, Code Programmer) and a Solver, all accessible via an LLM Powered Interface.
This framework addresses the lack of specialized expertise among ADN operators by enabling intelligent, flexible ADN dispatch through natural language queries, reducing reliance on human experts.
Tailored refinement techniques, including prompt methods, multi-round dialogues, and external knowledge enhancement, are developed for each LLM agent to improve accuracy and reliability of generated content.

Agent0: Leveraging LLM Agents to Discover Multi-value Features from Text for Enhanced Recommendations

Agent0 (Architect-Sentinel-Oracle model): introduces an LLM-driven, agent-based system for automated information extraction and feature construction, integrating an Architect (LLM for prompt generation/refinement), an Oracle (AutoML for feature relevance evaluation), Sentinels (LLMs for text-to-feature conversion), and Shared Memory (stores prompt-score tuples/data).
This system automates the discovery of high-signal multi-value features from raw, unstructured text for enhanced recommender systems by iteratively refining prompts based on dynamic feedback loops.
The framework mimics a data scientist's iterative process, enabling accelerated feature engineering and research in recommender system development.

SliceMate: Accurate and Scalable Static Program Slicing via LLM-Powered Agents

SLICEMATE (Accurate and Scalable Static Program Slicing via LLM-Powered Agents): introduces a novel static program slicing solution that integrates three specialized LLM agents—synthesis, verification, and refinement—orchestrated by a control module to produce program slices without explicit dependency graph construction or large-scale training data.
The framework reframes slicing as an LLM-driven code generation process, enabling it to scale to large, multi-file programs and robustly handle incomplete or non-compilable code by leveraging LLMs' broad programming knowledge.
SLICEMATE significantly outperforms traditional and learning-based slicing tools in accuracy and F1 score, boosting Top-10 localization accuracy in downstream debugging and bug localization tasks.

A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

RAG (Retrieval-Augmented Generation): introduces a framework that combines a neural text retrieval module and a text generation module, processing user queries through chunking, embedding, retrieval, re-ranking, and generation to enhance factual grounding and contextual relevance.
This systematic review traces RAG's evolution from early open-domain question answering to state-of-the-art implementations, analyzing its technical components, year-by-year progress, and enterprise deployment.
The review also evaluates RAG system performance, identifies persistent challenges like retrieval quality and privacy, and highlights emerging solutions such as hybrid retrieval and agentic architectures for future knowledge-intensive NLP systems.

MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

MindFlow+ (Self-Evolving Agent for E-Commerce Customer Service): introduces a self-evolving agent framework for e-commerce customer service, combining LLMs with imitation learning and offline reinforcement learning, using tool-augmented demonstration construction and reward-conditioned data modeling to generate contextually relevant and task-accurate responses.
The framework unifies tool-augmented reasoning and preference-aligned response generation into a single training process, enabling adaptive behavior without modifying the underlying LLM architecture.
It leverages a Unified Annotated Dataset, enriched with Factual Knowledge, Tool-Use Capabilities (including specific tools), and User Preference signals, to fine-tune a Pre-trained LLM for domain-specific, multi-turn dialogue.

24th July 2025

MemoCoder: Automated Function Synthesis using LLM-Supported Agents

MemoCoder: introduces a multi-agent framework for automated function synthesis, featuring a Planner (generates strategies), Code Writer (generates and refines code), Test Executor (executes code and identifies errors), Mentor (supervises repair and distills knowledge), and a Fixing Knowledge Base (stores successful repairs).
The framework enables collaborative problem-solving and persistent learning from past fixes by leveraging LLM-based agents and a memory module.
It consistently outperforms zero-shot prompting and self-repair strategies, demonstrating effectiveness in iterative refinement and knowledge-guided code generation.

Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback

Engineering Agent: introduces an automated program repair framework that fixes source code based on test failures at scale, integrating a Test Failure Manager Bot, an Engineering Agent with a Setup development environment, an Agentic Harness (comprising a ReAct Loop for Reason and Act components using Patching, Tests, and Static Analyses tools), a Verification stage, an LLM-as-a-Judge, and human Code Review, with a Discard action for rejected patches.
The framework leverages neuro-symbolic AI by providing feedback from static analysis tools and test execution traces to the agent, allowing it to refine its solutions iteratively within the ReAct loop.
This system aims to generate high-quality code patches that pass validation and align with human engineering standards, ultimately reducing manual intervention in program repair workflows.

Explainable Mapper: Charting LLM Embedding Spaces Using Perturbation-Based Explanation and Verification Agents

Explainable Mapper: introduces a framework for semi-automatic annotation of LLM embedding properties, featuring a visual analytics workspace and two LLM-based mapper agents (Explanation Agent and Verification Agent) that employ summarization, comparison, and perturbation operations to generate and verify explanations of mapper elements.
The framework leverages mapper graphs to summarize the topological structure of LLM embedding spaces, where nodes represent topological neighborhoods and edges connect overlapping neighborhoods.
It addresses the challenge of manually exploring vast embedding spaces by providing customizable LLM-based agents to explore and explain linguistic characteristics and verify explanation robustness.

HARLF: Hierarchical Reinforcement Learning and Lightweight LLM-Driven Sentiment Integration for Financial Portfolio Optimization

HARLF (Hierarchical Reinforcement Learning and Lightweight LLM-Driven Sentiment Integration for Financial Portfolio Optimization): introduces a three-tier hierarchical framework for portfolio optimization, with Observation (input data stream), FinBERT (extracts financial sentiment), Base RL Agents (process hybrid data), Meta-Agents (aggregate base decisions), Data-driven Meta-Agent (refines data-based outputs), NLP-based Meta-Agent (refines NLP-based outputs), Super-Agent (synthesizes final allocations), Action (portfolio weight output), Stable Baselines 3 (RL algorithms library), and PyTorch (deep learning framework), designed to combine sentiment signals from financial news with traditional market indicators for robust decision-making.
The framework leverages lightweight, domain-specific LLMs like FinBERT for sentiment analysis and Deep Reinforcement Learning (DRL) for sequential decision-making, addressing limitations of single-modal or flat architectures in financial markets.
Its hierarchical structure, comprising base RL agents, meta-agents, and a super-agent, enhances stability, scalability, and interpretability for adaptive portfolio allocations across diverse market regimes.

FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

FinMarBa (Market-Informed Dataset for Financial Sentiment Classification): introduces a novel market-driven annotation framework for financial sentiment classification, utilizing components like Collect News Headlines, Headline Generation, Ticker Identification, Historical Data Retrieval, Percentage Change Calculation, Quantile Determination, Classification, and Machine Label to produce the FinMarBa Dataset.
This framework leverages LLMs (specifically GPT-4) for automated headline extraction and ticker identification, then applies a quantile-based classification method using historical market data to assign sentiment labels.
The approach aims to eliminate human biases and more accurately reflect market reactions to financial news, providing a large-scale, objectively labeled dataset for fine-tuning and evaluating LLMs in financial NLP.

Reasoning Beyond the Obvious: Evaluating Divergent and Convergent Thinking in LLMs for Financial Scenarios

ConDiFi (Convergent-Divergent for Financial Reasoning Benchmark): introduces a novel benchmark designed to assess both divergent and convergent reasoning in LLMs for financial scenarios, featuring distinct dataset generation pipelines and a multi-dimensional evaluation framework.
The benchmark includes 607 macro-financial prompts for divergent reasoning, evaluated by a GPT-4o Judging Model across five dimensions, and 990 multi-hop adversarial MCQs for convergent reasoning.
Its dataset construction mitigates data contamination by using post-LLM training cutoff data and employs adversarial pipelines to generate challenging questions, providing a holistic standard for measuring LLM cognitive capabilities in finance.

ProactiveVA: Proactive Visual Analytics with LLM-Based UI Agent

ProactiveVA (Proactive Visual Analytics): introduces an LLM-based UI agent that monitors user interactions and proactively delivers context-aware assistance, integrating Perception, Reasoning, and Acting modules, an LLM, and Storage (Memory/Knowledge) to interact with the Visual Analytics System and User.
The framework autonomously perceives user needs from VA interaction logs, provides tailored suggestions, and offers intuitive guidance through interactive system exploration.
This approach aims to enhance human-AI collaboration by addressing limitations of reactive AI assistants, ensuring timely, interpretable, and controllable support in dynamic analytical workflows.

Policy Disruption in Reinforcement Learning: Adversarial Attack with Large Language Models and Critical State Identification

ARCS (Adversarial Rewards and Critical State Identification): introduces an adaptive adversarial attack framework that leverages LLMs to generate tailored adversarial rewards and identifies critical states to disrupt victim RL policies.
The framework includes a Reward Iteration Optimization Module for LLM-guided reward generation and a Critical State Identification Mechanism for fine-tuning attacks on high-impact decision points.
This approach enables black-box adversarial attacks by guiding an attacker policy to induce suboptimal actions in a victim RL agent without direct environment or policy manipulation.

TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

TELEVAL (Dynamic Benchmark for Spoken Language Models): introduces a dynamic benchmark for evaluating Spoken Language Models (SLMs) in Chinese interactive scenarios, with Explicit Semantics (linguistic content understanding/response), Paralinguistic and Implicit Semantics (acoustic cues/implicit intentions), and System Capabilities (system-level performance) components, designed to align evaluation protocols with real-world user interactions.
The benchmark defines three evaluation dimensions, focusing on SLMs' ability to extract implicit cues from user speech and respond appropriately without explicit instructions.
TELEVAL adopts a dialogue format consistent with real-world usage, evaluating text and audio outputs separately to provide a user-centered evaluation framework.

23rd July 2025

BetterCheck: Towards Safeguarding VLMS for Automotive Perception Systems

BetterCheck (adapted SelfCheckGPT): introduces a framework for safeguarding Vision Language Models (VLMs) in automotive perception systems, utilizing a Curated Dataset, Object Labels, LLM Models (Captioners), Caption Generation, Sentence Decomposition, Human Annotators, LLM Models (Checkers), BetterCheck Results, and Data Analysis to detect VLM hallucinations.
The framework systematically assesses the performance of state-of-the-art VLMs (GPT-4o, LLaVA, MiniCPM-V) in captioning real-world automotive video footage from the Waymo Open Dataset.
It evaluates VLM capabilities in identifying and overlooking traffic agents, and their ability to self-check generated captions for consistency and correctness.

Symbiotic Agents: A Novel Paradigm for Trustworthy AGI-driven Networks

Symbiotic Agents: introduces a novel paradigm combining Large Language Models (LLMs) with real-time optimization algorithms, where LLMs (central decision-making) interpret high-level intents and supervise optimizers (input pre-processor, output controller) to enable trustworthy, adaptive, and real-time control in AGI-driven networks.
The framework implements two agent designs: Type I agents for dynamic Radio Access Network (RAN) control and Type II agents for multi-agent Service-Level Agreement (SLA) negotiations, both leveraging the symbiotic relationship between LLMs and optimizers.
This symbiotic design significantly reduces decision errors, improves accuracy, and enables the use of smaller language models (SLMs) with substantially lower overhead, bridging the gap towards trustworthy Artificial General Intelligence (AGI) in network management.

DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning

DynaSearcher (Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning): introduces a search agent that leverages dynamic knowledge graphs and multi-reward reinforcement learning, including a Policy LLM, Search Tool List, Doc Search Tool, KG Search Tool, Tools Module, External Environment, Multi-Reward Reinforcement Learning Framework (with Gain Reward, Penalty Reward, and Accuracy Reward), Iterative Reasoning-Retrieval Loop, and Answer Generation, to guide multi-step reasoning and generate precise answers.
The framework integrates structured knowledge graphs to ensure factual consistency in intermediate queries and employs a multi-reward RL mechanism for fine-grained control over retrieval accuracy, efficiency, and response quality.
This approach mitigates reasoning deviations from irrelevant information and promotes efficient reasoning paths, leading to state-of-the-art performance in complex multi-hop question answering tasks.

Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments

CBA (Compliance Brain Assistant): introduces a conversational, agentic AI assistant designed to boost compliance task efficiency, featuring a Router LLM that directs queries to either a FASTTRACK flow for simple requests or a FULLAGENTIC flow for complex tasks, supported by various LLM-based components, tools, and memory.
The system intelligently chooses between a low-latency FASTTRACK path for context retrieval and a multi-step FULLAGENTIC path for complex reasoning and tool invocation.
Experimental evaluations demonstrate that CBA substantially improves performance over vanilla LLMs in terms of keyword match rate and LLM-judge pass rate for compliance-related queries.

Leveraging Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance

LLM Reasoning Agent: introduces a novel framework for identifying operational bottlenecks in warehouse planning, integrating Knowledge Graphs (KGs) and Large Language Models (LLMs) through a dual-path architecture that includes query classification, iterative reasoning, and self-reflection mechanisms.
The framework transforms raw Discrete Event Simulation (DES) output data into a semantically rich Knowledge Graph, enabling LLM-based agents to interpret natural language questions by generating sequential, conditioned sub-questions and precise Cypher queries.
This approach aims to bridge the gap between simulation modeling and advanced AI-driven data analysis, offering an intuitive method for extracting actionable insights and reducing time-to-insight for industrial data analysis.

Agent Identity Evals: Measuring Agentic Identity

AIE (Agent Identity Evals): introduces a rigorous, statistically-driven, empirical framework for measuring LMA identity stability over time, including capabilities, properties, and recovery from state perturbations.
The framework utilizes various LLMs for generating agent profiles, planning tasks, evaluating identity metrics, supervising planning, and injecting distractions.
It integrates memory modules, tool APIs, and an embedding model to assess how these scaffolding solutions mitigate LLM pathologies affecting agent identity.

LLM Meets the Sky: Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks

LLM-HeMARL-S2DC introduces a hierarchical optimization framework that integrates LLM-generated expert policies into multi-agent reinforcement learning for UAV trajectory optimization, coupled with an S2DC algorithm for secure precoding, to maximize secrecy rate and minimize propulsion energy in heterogeneous UAV networks.
The framework addresses the complex trade-off between communication secrecy and energy efficiency by decoupling the problem into an outer-layer LLM-HeMARL for trajectory design and an inner-layer S2DC for precoding optimization.
This approach leverages LLM's heuristic guidance to accelerate learning and improve stability for UAV agents, enabling energy-aware, security-driven trajectories without real-time LLM inference overhead.

CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

CogDual: introduces a novel Role-Playing Language Agent (RPLA) that adopts a cognize-then-respond reasoning paradigm, integrating a Large Language Model (LLM) with Dual Cognition, which encompasses Situational Awareness (SA) and Self-Awareness (SAself), and is optimized through a two-stage training framework involving Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) with implicit rule-based rewards.
The Dual Cognition component enables the LLM to first process external environmental and social cues via Situational Awareness, then reflect on internal states and intentions through Self-Awareness, before generating a contextually relevant and psychologically consistent response.
The RL stage further enhances performance using two general-purpose reward schemes, Inference-Conditioned Likelihood Gain (ICLG) and Latent Semantic Alignment (LSA), which promote causal consistency and semantic fidelity in text generation, respectively, optimized via Grouped Reward Policy Optimization (GRPO).

Resilient Multi-Agent Negotiation for Medical Supply Chains: Integrating LLMs and Blockchain for Transparent Coordination

Hybrid Framework: introduces a novel system for medical supply chain coordination, integrating blockchain technology with an LLM-powered multi-agent negotiation system, comprising an Off-Chain Decision Layer (adaptive decision-making), a Cross-Layer Communication Protocol (bridges off-chain/on-chain), and an On-Chain Execution Layer (verifiable enforcement/auditability).
The Off-Chain Decision Layer utilizes LLM-powered agents (Manufacturer, Distributor, Hospital) equipped with reasoning tools for dynamic, context-sensitive resource allocation and negotiation.
The On-Chain Execution Layer, powered by smart contracts, ensures immutable, transparent, and auditable enforcement of decisions, enhancing resilience and accountability in crisis scenarios.

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

ARIA (Adaptive Reflective Interactive Agent): introduces a framework for LLM agents to continuously learn updated domain knowledge at test time, featuring an LLM Agent (internal reasoning, task execution), Intelligent Guidance Solicitation (self-reflection, query formulation), a Human Expert Oracle (provides guidance, corrections), Human-Guided Knowledge Adaptation (integrates human feedback, updates KR), and a Knowledge Repository (timestamped, structured knowledge base).
The framework assesses its own uncertainty through structured self-dialogue, proactively identifies knowledge gaps, and requests targeted explanations or corrections from human experts.
The system systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries.

Agent WARPP: Workflow Adherence via Runtime Parallel Personalization

WARPP (Workflow Adherence via Runtime Parallel Personalization): introduces a training-free, modular framework that combines multi-agent orchestration with runtime personalization to improve workflow adherence in LLM-based systems, featuring an Orchestrator Agent, Authenticator Agent, Personalizer, and Fulfillment Agent, supported by LLM as Client, LLM Agent, and LLM as a Judge, utilizing Client Info Tools, Full Routine, Client Data, Personalized Instructions, Trimmed Instructions + APIs, and various APIs/Tools.
The framework dynamically prunes conditional branches based on user attributes, reducing reasoning overhead and narrowing tool selection at runtime.
WARPP deploys a parallelized architecture where a dedicated Personalizer agent operates alongside modular, domain-specific agents to dynamically tailor execution paths in real time.

I2I - STRADA – INFORMATION TO INSIGHTS VIA STRUCTURED REASONING AGENT FOR DATA ANALYSIS

I2I-STRADA (Information-to-Insight via Structured Reasoning Agent for Data Analysis): introduces an agentic architecture designed to formalize the data analysis reasoning process, with Goal construction (infers user analytical goal), Contextual reasoner (grounds analysis with context), Workflow scaffolding (generates global action plan), Adaptive planning and executor (iteratively refines execution plans), Context aware tool creation (dynamically creates data processing tools), Dynamic state handler (maintains agent's working memory), and Communication handler (manages results presentation).
This framework models how data analysis unfolds via modular sub-tasks that reflect cognitive steps, ensuring structured reasoning and planning coherence.
Evaluations on DABstep and DABench benchmarks demonstrate its superior performance in planning quality and insight alignment compared to prior systems.

22nd July 2025

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

ThinkAct: introduces a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning, training a multimodal LLM to generate embodied reasoning plans guided by action-aligned visual rewards.
The framework's Reasoning MLLM generates reasoning plans and a visual plan latent, which then conditions the Action Model for robust action execution, enabling asynchronous operation for slow thinking and fast control.
ThinkAct leverages action-aligned visual feedback, including goal completion and trajectory consistency, to reinforce reasoning, leading to capabilities like few-shot adaptation, long-horizon planning, and self-correction in complex embodied AI tasks.

LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs

LingBench++ (A Linguistically-Informed Benchmark and Reasoning Framework): introduces a Multi-Agent Framework for solving linguistic problems, which includes Solver Agents (proposes initial linguistic hypotheses), Aggregator Agents (collects, synthesizes solutions), a Final Aggregator (generates final solution), and a Grammar Agent (retrieves linguistic reference knowledge).
This multi-round framework enhances LLM reasoning by enabling iterative hypothesis generation, solution aggregation, and external knowledge retrieval for complex linguistic tasks.
The framework emphasizes stepwise reasoning quality and grammar-informed verification, providing diagnostic insights beyond final answer accuracy.

Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Agentar-Fin-R1: introduces a family of financial LLMs, engineered based on the Qwen3 foundation model, with a development pipeline that includes a Data Pipeline (constructs high-quality data), a Label System (structures data synthesis), and a Training Pipeline (optimizes LLM performance).
The Data Pipeline integrates source governance, multi-agent data synthesis, and rigorous verification to ensure data quality and domain relevance for financial applications.
The Training Pipeline employs a weighted training framework and a two-stage strategy for efficient knowledge injection and challenge enhancement, complemented by an attribution loop for continuous model refinement.

Test-Time-Matching: Decouple Personality, Memory, and Linguistic Style in LLM-based Role-Playing Language Agent

TTM (Test-Time-Matching): introduces a training-free role-playing framework that automatically decouples character features into personality, memory, and linguistic style, utilizing a structured three-stage generation pipeline for controlled role-playing.
The framework's pipeline includes a Styleless Response Generation stage, a Memory-checked Response Generation stage, and a Stylized Response Generation stage, ensuring high-fidelity and stylistically consistent character dialogues.
TTM enhances controllability and personalization in role-playing language agents by enabling seamless combinations across diverse linguistic styles and variations in personality and memory.

DELIBERATIVE SEARCHER: IMPROVING LLM RELIABILITY VIA REINFORCEMENT LEARNING WITH CONSTRAINTS

Deliberative Searcher: introduces a reasoning-primary, information-secondary framework that integrates LLM deliberation with selective web search and confidence calibration, trained via a constrained reinforcement learning algorithm to align confidence with correctness.
The framework enables an agent to perform multi-step reflection and verification over external data, dynamically updating its confidence metrics through actions like THINK, SEARCH, and READ.
It optimizes for accuracy under a soft reliability constraint, utilizing a reward signal composed of format compliance, answer correctness, and reliability rewards to produce trustworthy outputs.

RAVine: Reality-Aligned Evaluation for Agentic Search

RAVine (Reality-Aligned Evaluation for Agentic Search): introduces a comprehensive evaluation framework for agentic LLMs with search, addressing misalignments in existing methods by targeting multi-point queries and long-form answers, and evaluating the iterative process.
The framework includes an Agentic LLM with Search, a Web Corpus, Search and Fetch Tools, an Attributable Nuggets Collection for fine-grained ground truth, and both Block-level and Process-Oriented Evaluations.
It provides a full-process, reproducible, and goal-aligned evaluation sandbox that assesses report quality, tool performance, and efficiency, offering insights into agentic search system development.

Agentic RAG with Knowledge Graphs for Complex Multi-Hop Reasoning in Real-World Applications

INRAExplorer (Agentic RAG system): introduces an agentic RAG system with an LLM-based agent, a hybrid knowledge base (Vector Database and Knowledge Graph), and specialized tools (SearchGraph, SearchPublications, SearchConceptsKeywords, IdentifyExperts) for complex multi-hop reasoning in scientific data.
The system empowers its LLM-based agent to dynamically navigate tools, gather evidence, and plan subsequent steps, enabling multi-hop reasoning and comprehensive answer generation from scientific data.
This approach overcomes classical RAG limitations by deeply integrating knowledge graph querying as a core agentic capability, enabling precise, relationally-aware retrieval and adaptive multi-hop reasoning.

Towards Enforcing Company Policy Adherence in Agentic Workflows

Framework for Enforcing Business Policy Adherence in Agentic Workflows: introduces a deterministic, transparent, and modular framework with an Offline Buildtime Stage that compiles Policy Documents, Toolkit, and Data Schema into verifiable ToolGuards (Python code) via a Tool-Policy Mapper and ToolGuard Generator, and a Runtime Integration where these ToolGuards ensure compliance before each LLM Agent action within a ReAct Workflow, preventing non-compliant Tool Invocations for the Customer.
The framework's buildtime phase leverages an LLM-based Tool-Policy Mapper to transform natural language policies into a Compact Tool-Oriented Policy Representation, which then feeds into an LLM-based ToolGuard Generator to produce executable Python ToolGuards.
This approach aims to bridge the gap between flexible AI behavior and organizational constraints by proactively preventing policy violations in LLM-based agentic workflows, ensuring reliable and predictable enterprise-scale operations.

LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning

ColaUntangle (collaborative consultation framework for commit untangling): introduces a multi-agent LLM-driven system for untangling commits by reasoning about explicit and implicit code dependencies, with all its components, where the system integrates structured code change information with specialized LLM agents in an iterative consultation process.
It leverages Structured Code Change Information, including explicit and implicit contexts derived from multi-version Program Dependency Graphs, to inform its Multi-Agent Architecture comprising Explicit Worker, Implicit Worker, and Reviewer LLM agents.
The Untangling Workflow orchestrates the iterative collaborative consultation among these agents to achieve consensus on untangling decisions and provide explanations for improved transparency.

Application of LLM Guided Reinforcement Learning in Formation Control with Collision Avoidance

LLM-FCCA (LLM-Guided Formation Control with Collision Avoidance): introduces a framework that leverages LLMs to dynamically generate and refine reward functions for multi-agent formation control with collision avoidance, utilizing an LLM Reward Designer, RL Training, Evaluation, Policy, Real World Deployment, Environment, Task Description and Tips, and Agent Observations Format (Local State, Obstacles State, Communication Data).
The framework dynamically adjusts reward functions online using advanced evaluation metrics, enabling efficient simultaneous achievement of formation control and obstacle avoidance.
Empirical studies in both simulation and real-world settings validate the approach's practicality and effectiveness, demonstrating superior performance with fewer iterations compared to human-designed methods.

Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery

Agent PULSE (Patient Understanding and Liaison Support Engine): introduces a voice-based AI agent for digital health delivery, integrating a Voice Interface (standard telephone lines), an AI Engine (core intelligence platform) with LLMs (inference, fine-tuning service), SOLOMON (conversation management, analysis), and RAG (combines questionnaire, medical knowledge), and a Physician Dashboard (healthcare provider interface).
This system aims to bridge economic and accessibility gaps in healthcare by providing scalable, cost-effective, and equitable solutions for preventive care and continuous patient monitoring.
A pilot study with 33 inflammatory bowel disease patients demonstrated high patient acceptance and significant workflow advantages for healthcare providers, validating its potential to fill care gaps.

RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs

Self-reflection agent: introduces a framework for Verilog generation, integrating an LLM with Design Specification input and Feedbacks from Syntax Checker, Testbench Verification, and Formal Verification for iterative code modification.
This agent iteratively refines generated Verilog code by leveraging verification feedback to improve correctness and reliability.
The iterative feedback loop aims to address syntax errors, functional errors, and formal verification failures, enhancing LLM-generated hardware designs.

Do Large Language Models Have a Planning Theory of Mind? Evidence from MINDGAMES: a Multi-Step Persuasion Task

MINDGAMES (Planning Theory of Mind Task): introduces a novel task framework for evaluating LLMs' ability to dynamically plan actions and strategically intervene on others' mental states, featuring a Persuader Agent (human or LLM participant), a Target Agent (hard-coded rational bot), a Dialogue Environment (multi-turn conversational interface), Proposals (three selectable options), Value Functions (agent preference definitions), Information Sets (agent knowledge states), and Mental States (target's beliefs and desires).
This framework assesses "planning theory of mind" (PToM) by requiring the persuader to infer the target's beliefs and desires to persuade them to alter their behavior, moving beyond passive ToM assessments.
The task involves the persuader selectively disclosing information to the target, who has partial information and makes rational choices based on its value function, highlighting a capability gap between human and LLM social reasoning.

Benchmarking LLM Privacy Recognition for Social Robot Decision Making

LLM Privacy Recognition Benchmark: introduces a methodology to evaluate LLMs' privacy awareness in social robot interactions, encompassing scenario generation, human preference elicitation, LLM evaluation with various prompting strategies, and subsequent analysis.
The benchmark leverages the Contextual Integrity framework to create privacy-relevant scenarios and crowdsourced human data to establish preferred robot behaviors and user privacy orientations.
It assesses LLM conformity to human privacy expectations, identifies the impact of different prompting strategies, and provides insights for designing privacy-aware LLM-powered social robots.

LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

LLM Economist: introduces a novel framework for designing and assessing economic policies using agent-based modeling in strategic environments with hierarchical decision-making, featuring persona-conditioned worker agents and a planner agent optimizing tax schedules via in-context reinforcement learning.
The framework simulates a Stackelberg game where worker agents choose labor supply to maximize utility, and a planner agent proposes tax schedules to maximize social welfare, all within a language-driven environment.
This approach enables credible fiscal experimentation by optimizing heterogeneous utilities, generating demographically realistic agent populations, and performing natural language-based mechanism design.

Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

Screen2AX: introduces a vision-based pipeline for automatic macOS accessibility generation, processing UI screenshots through UI element detection, text detection, element description, and hierarchy generation to produce structured hierarchical accessibility metadata.
The framework leverages YOLOv11 for element localization, classification, and hierarchical grouping, and BLIP for generating semantic descriptions of UI elements.
The system aims to bridge the gap in macOS accessibility support by creating real-time, tree-structured accessibility metadata from single screenshots, outperforming built-in tools in quality.

Augmenting Von Neumann's Architecture for an Intelligent Future

Augmented Von Neumann Architecture: introduces a novel computer architecture that extends the classical Von Neumann model with a dedicated Reasoning Unit (RU) for native artificial general intelligence capabilities, alongside the Central Processing Unit (CPU), Arithmetic Logic Unit (ALU), Memory Subsystem, Control Unit, Input/Output System, and a Semantic Interconnect Bus (SIB).
This architecture enables autonomous agents to perform goal-directed planning, dynamic knowledge manipulation, and introspective reasoning directly within the computational substrate at system scale.
The framework establishes a computational foundation where reasoning, learning, and adaptation emerge as intrinsic execution properties, moving beyond traditional sequential computation.

Distributed Oscillatory Guidance for Formation Flight of Fixed-Wing Drones

Distributed Oscillatory Guidance: introduces a novel approach for fixed-wing drone formation flight by modulating path progression through a non-negative input-saturated consensus strategy, integrating an Inverse Kinematics Guiding Vector Field (IK-GVF) path-following controller, and leveraging fixed-wing drone dynamics.
This method enables coordinated formation flight without requiring speed actuation, achieving synchronized path following by inducing controlled oscillations in the guiding vector field.
The approach ensures robust convergence to desired formations even with speed fluctuations, validated through numerical simulations and real-world flight experiments.

From model-based learning to model-free behaviour with Meta-Interpretive Learning

MIL-M2MF (Meta-Interpretive Learning for Model-Based to Model-Free Behavior): introduces a framework that uses a MIL System to learn a Model-based Solver, which then generates examples to train a Model-free Controller, enabling autonomous agents to combine planning and exploration capabilities in novel environments.
The Model-based Solver plans actions with full environment knowledge, while the Model-free Controller acts without a model, relying on learned state-action mappings.
The framework demonstrates the equivalence in problem-solving ability between the learned Solver and Controller on grid navigation tasks, utilizing specialized FSC Executors and a Grid Master environment.

VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

VL-CLIP (Visual Grounding and LLM-Augmented CLIP Embeddings): introduces a novel framework that enhances CLIP embeddings by integrating Visual Grounding (localizes product regions) and an LLM-based Agent (enriches text descriptions) to improve multimodal recommendations.
The framework refines image representations via Grounding DINO and enhances textual features through an iterative LLM process involving a Summarizer, Evaluator, and Refiner, before being processed by CLIP's dual encoders and optimized with contrastive loss.
Deployed on a large e-commerce platform, the framework significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality, demonstrating the practical efficacy of combining object-aware visual grounding and LLM-enhanced text representation.

Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

ACF (Adaptive Coordination Framework): introduces a coordination framework for multi-agent LLM systems, with Orchestrator (Coordinates task execution), Dynamic Task Routing (Reassigns tasks dynamically), Role Self-Optimization (Agents adapt roles), Shared Long-Term Memory (Persistent document store), Role Agents (Specialized task agents), Evaluator Agent (Scores and selects outputs), Feedback Bus (Facilitates inter-agent communication), and Parallel Agents (Multiple agents for competition), designed for scalable document understanding.
The framework enhances robustness and accuracy in complex financial document analysis by integrating dynamic task routing, bidirectional feedback, and competitive parallel agent evaluation.
This system improves factual coverage, coherence, and efficiency over static and partially adaptive baselines, demonstrating the benefits of adaptiveness and structured competition in multi-agent LLM systems.

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Evaluation Agent: introduces a tool-using agentic system to provide higher quality feedback on long-form factual, advanced coding, and math tasks, by augmenting LLM-as-a-Judge with external validation tools, including Model Responses (input for evaluation), Initial Domain Assessment (LLM selects tools), Tool Usage (orchestrates external validation), Fact Check Tool (verifies factual statements), Code Execution Tool (executes, verifies code), Math Check Tool (validates math calculations), Provide Collected Information (aggregates tool outputs), Final Decision (LLM makes judgment), Judgement (final preference output), and Baseline Annotator (fallback evaluation system).
The system leverages an Initial Domain Assessment to select relevant tools like Fact Check, Code Execution, and Math Check, then uses a Final Decision component to make judgments based on tool outputs, reverting to a Baseline Annotator if no tools are useful.
This framework aims to improve AI annotator performance by grounding evaluations in external validation, reducing reliance on LLM's internal knowledge and biases.

AURA: A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation

AURA (A Multi-Modal Medical Agent for Understanding, Reasoning & Annotation): introduces, "AURA is an agentic AI system for comprehensive analysis, explanation, and evaluation of medical images", with AURA Agent (Multi-modal medical agent), LLM Head (Core reasoning engine), User Inquiry Interface (Input handling), ReAct-style Reasoning Loop (Orchestrates thought-action-observation), Memory (Stores states and results), Tool Orchestration Module (Manages tool execution), Specialized Tools (Modular medical utilities), Visual Question Answering Tool (Radiology dialogue/reporting), Medical Image Segmentation Tool (Localizes clinical regions), Counterfactual Image Generation Tool (Generates explanatory images), Self Evaluation & Analysis Tool (Assesses diagnostic relevance), Grounded Report Generation Tool (Aligns findings visually), Counterfactual Editing Tools (Precision image editing), Segmentation and Detection Tools (Anatomy/pathology classification), Analysis and Visualization Tools (Quantifies edits/visualizes), where it leverages an LLM-based architecture and a modular toolbox to provide interpretable, multimodal visual-linguistic explanations for medical imaging.
The system emphasizes dynamic visual-linguistic explanations, introspective evaluation, and adaptive reasoning, enabling it to operate effectively even with limited pathological knowledge.
Its modular design and ReAct-style reasoning loop allow autonomous self-assessment, tool orchestration, and generation of high-quality, clinically relevant outputs for chest X-ray analysis.

Towards Simulating Social Influence Dynamics with LLM-based Multi-agents

Forum Simulation Framework: introduces an LLM-based multi-agent conversational environment designed to simulate social influence dynamics, featuring Dialogue Orchestration, LLM-based Multi-agents with defined Agent Personas, and Evaluation Metrics.
The framework orchestrates asynchronous text-based discussions over five rounds, allowing LLM-based agents to adjust stances based on peer input and a shared conversation log.
It systematically investigates how varying LLM capacities and architectures influence simulated social interactions, quantifying conformity, polarization, and fragmentation.

From Cloud-Native to Trust-Native: A Protocol for Verifiable Multi-Agent Systems

TrustTrack (Trust-Native Protocol Stack): introduces a protocol stack for verifiable multi-agent systems, with an Agent Layer (agent execution environment), a Protocol Layer (protocol management), and a Ledger Layer (immutable data storage).
TrustTrack reframes compliance as a design constraint by embedding structural guarantees like verifiable identity, policy commitments, and tamper-resistant behavioral logs directly into agent infrastructure.
The protocol enables cryptographic traceability of agent behavior, supporting verifiable provenance and accountability in high-stakes, multi-agent workflows.

21st July 2025

LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

LLM Economist: introduces a novel framework for agent-based economic modeling, featuring a Tax Planner (LLM agent) that designs tax policies and Worker Agents (LLM agents) that adjust labor, all interacting within an Environment that simulates economic outcomes.
The framework employs in-context reinforcement learning for both the planner and workers, enabling them to adapt to strategic environments with hierarchical decision-making and optimize their respective utility functions.
It uniquely integrates census-calibrated population modeling, dynamic tax-mechanism optimization, and democratic governance, providing a testbed for fiscal policy evaluation at a societal scale.

Towards physician-centered oversight of conversational diagnostic AI

g-AMIE (guardrailed-AMIE): introduces a novel asynchronous oversight paradigm for conversational diagnostic AI, enabling AI-driven patient intake with strict guardrails and subsequent human physician oversight via a dedicated clinician cockpit.
This framework decouples AI-driven patient intake from medical advice delivery, mandating human oversight by licensed primary care physicians to ensure safety and accountability.
A randomized, blinded virtual Objective Structured Clinical Examination (OSCE) study demonstrated g-AMIE's superior performance in high-quality intake and case summarization compared to human control groups.

A Framework for Analyzing Abnormal Emergence in Service Ecosystems Through LLM-based Agent Intention Mining

EAMI: introduces a framework for analyzing abnormal emergence in service ecosystems, integrating Agent-Based Modeling, Inspector Agent, Analysis Agent, Intention Repository, Embedding Module, Clustering Module, and Intention Temporal Emergence Diagram to bridge microscopic agent intentions with macroscopic service emergence.
The framework employs LLMs within its Inspector and Analysis Agents, along with a Memory component and Dual-Perspective Thought Extraction, to track and extract agent thoughts, enabling dynamic and interpretable emergence analysis.
It identifies phase transition points in group intentions through embedding and clustering, then visualizes their temporal evolution to explain complex system phenomena.

GasAgent: A Multi-Agent Framework for Automated Gas Optimization in Smart Contracts

GasAgent: introduces a multi-agent framework for automated Gas optimization in smart contracts, including Seeker (identifies known patterns), Innovator (discovers new patterns), Executor (applies and validates changes), Manager (orchestrates workflow and reports), Gas Waste Pattern Library (stores known patterns), and New Pattern Blacklist (filters invalid patterns), designed to combine compatibility with existing patterns and automated discovery/validation of new patterns for end-to-end optimization.
The framework addresses limitations of manual Gas optimization and single LLM approaches by enabling specialized agents to collaborate in a closed loop for identifying, validating, and applying Gas-saving improvements.
GasAgent demonstrates effectiveness by optimizing real-world contracts with an average deployment Gas saving of 9.97% and usability for LLM-generated contracts, serving as a reliable optimization layer.

BUGSCOPE: LEARN TO FIND BUGS LIKE HUMAN

BUGSCOPE (BugScope: Learn to Find Bugs Like Human): introduces an LLM-driven multi-agent system that emulates human auditors' workflow, including a Context Retrieval Agent (retrieves relevant code context) and a Bug Detection Agent (detects and validates bugs), which together automate the end-to-end auditing process.
The Context Retrieval Agent utilizes Retrieval Strategy Synthesis with a Seed Extractor and Retrieval Direction, and performs Slicing-Based Context Retrieval with an AST Parser and LLM to gather relevant code snippets.
The Bug Detection Agent synthesizes a detection prompt using an LLM, Reasoning Hints, and Prompt Reflection, then employs the LLM for Bug Validation to generate structured Bug Reports, effectively generalizing across diverse anti-patterns.

DHEvo: Data-Algorithm Based Heuristic Evolution for Generalizable MILP Solving

DHEvo (Data-Algorithm Based Heuristic Evolution): introduces a data-algorithm co-evolution framework that iteratively selects representative MILP instances and evolves corresponding heuristics, with all Initialization-, Sample-, Iterative evolution-, Final Selection-components, where it significantly improves the generalization ability of generated heuristics for Mixed-Integer Linear Programming (MILP) solving.
The framework employs an LLM-based MA-Evolution System, including Designer, Coder, Reviewer, and Judger agents, to generate and refine data-code pairs simultaneously through a debate cycle.
This co-evolutionary approach ensures mutual adaptation between instances and algorithms, leading to robust generalization and superior performance compared to human-designed and existing LLM-based methods.

PHYSGYM: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors

PHYSGYM (Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors): introduces a novel benchmark suite and simulation platform for assessing LLM-based scientific reasoning, including an Environment (simulated physics problems), Interface (controls experiments and data), and Evaluator (assesses model performance), designed to systematically control task complexity and prior knowledge for interactive physics discovery.
The platform enables agents to actively probe environments, gather sequential data under constraints, and formulate hypotheses about underlying physical laws, providing fine-grained control over prior knowledge levels to dissect agent performance.
PHYSGYM offers standardized evaluation protocols and metrics for hypothesis accuracy and model fidelity, demonstrating its utility in differentiating LLM capabilities based on varying priors and task complexity.

HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied Theatrics

HAMLET (Hyperadaptive Agent-based Modeling for Live Embodied Theatrics): introduces a multi-agent framework for AI drama, with Actor Designer (Generates character profiles), Plot Designer (Composes narrative draft/scenes/props), Reviewer (Evaluates character/plot rationality), Director (Integrates profiles, creates blueprint), Narrative Blueprint (Structured guide for performance), Planner (Designs/reviews multi-trajectory beats), Transfer (Monitors flag fulfillment, advances plot), Advancer (Ensures plot progression, directs actors), Actor (Performs narrative, makes decisions), Perceive And Decide (PAD) Module (Guides actor strategic decisions), Internal State (Actor's self-awareness, goals), External Stimulus (Environmental/contextual information), Tool Calling (Generates speech/actions), Narrator (Adjudicates interactions, updates environment), Critic (Evaluates drama performance quality), and Online Performance Environment (Dynamic, interactive theatrical setting), enabling autonomous and immersive interactive drama.
The framework operates in two stages: offline planning to generate a narrative blueprint from a simple topic, and online performance for dynamic, improvisational theatrical experiences.
It incorporates a comprehensive evaluation method, HAMLETJudge, to assess character performance, narrative quality, and interaction experience, achieving top-ranking results.

PhishIntentionLLM: Uncovering Phishing Website Intentions through Multi-Agent Retrieval-Augmented Generation

PhishIntentionLLM: introduces a multi-agent RAG framework that uncovers phishing intentions from website screenshots, employing a Vision Analysis Agent, Context Enrichment Agent, Primary Classification Agent, a Specialist Analysis Layer with dedicated expert agents, a Validation Agent, and a dual-layer Knowledge Base with a feedback loop.
The framework leverages LLMs' visual-language capabilities and a dual-layer knowledge architecture to provide scalable and interpretable intention-aware phishing analysis.
It significantly outperforms single-agent baselines and prior work in precision and recall for detecting credential theft, financial fraud, malware distribution, and personal information harvesting.

Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems

LLM Tool-Agent System Analysis Framework: introduces a comprehensive analysis of failed parameter filling in LLM tool-agent systems, utilizing a Failure Taxonomy Construction (Methodology component) and an Evaluation Process (Methodology component) to investigate "Butterfly Effects" in toolchains.
The paper systematically identifies five parameter failure patterns—Missing Information, Redundant Information, Hallucination Name, Task Deviation, and Specification Mismatch—and constructs a taxonomy using Grounded Theory.
It applies 15 input perturbation methods across user queries, tool documents, and tool returns to analyze their impact on LLM parameter behavior and proposes actionable improvements for tool agent reliability.

SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search

SPAR (Scholar Paper Retrieval): introduces a modular multi-agent framework for academic paper retrieval that leverages LLM-based agents and RefChain for enhanced search.
This framework performs fine-grained query understanding, multi-source retrieval, citation-driven knowledge expansion, and relevance-aware reranking to mirror human research exploration.
SPAR significantly outperforms strong baselines on AutoScholar and SPARBench, a new expert-annotated benchmark, demonstrating its robustness and generalization in complex academic search scenarios.

FaultLine: Automated Proof-of-Vulnerability Generation using LLM Agents

FAULTLINE: introduces an LLM agent workflow that automatically generates Proof-of-Vulnerability (PoV) test cases by tracing data flow, reasoning about control flow conditions, and iteratively refining tests based on execution feedback.
The framework leverages an LLM agent augmented with various tools to explore codebases, identify vulnerability sources and sinks, and derive precise input conditions required to trigger vulnerabilities.
The system's multi-stage reasoning process, including data flow analysis, control flow analysis, and a feedback-driven repair loop, enhances the LLM's ability to generate effective and accurate PoV tests across different programming languages.

Solving Formal Math Problems by Decomposition and Iterative Reflection

Delta Prover: introduces an agent-based framework that orchestrates a general-purpose LLM, Lean 4 Proof Environment, and Retrieval Model, with Reflective Decomposition, Iterative Proof Repair, Automatic Proof Consolidation, and a Domain-Specific Language (DSL), to solve formal math problems by iteratively refining proofs and decomposing complex theorems.
The framework leverages the LLM's inherent reasoning and reflection capabilities to interactively construct formal proofs in Lean 4, circumventing the need for model specialization or extensive fine-tuning.
It achieves state-of-the-art performance on the miniF2F-test benchmark by systematically tackling complex proofs, learning from mistakes, and producing machine-verifiable results.

EchoVoices: Preserving Generational Voices and Memories for Seniors and Children

EchoVoices: introduces an end-to-end digital human pipeline for seniors and children, with k-NN enhanced Whisper ASR model (speech recognition), LLM-driven Agent (persona distillation/response generation), Persona Card (user identity summary), RAG (memory retrieval), Memory Fragments (vector database), Age-adaptive VITS model (speech synthesis), Wav2Lip (lip synchronization), and GFPGAN (photorealistic face rendering), designed to create persistent digital personas by preserving unique voices and memories.
The system processes spoken queries from seniors or children, transcribes them using a k-NN augmented Whisper model, generates context-aware responses via an LLM-driven agent with a RAG-based memory, and synthesizes age-appropriate speech using a two-stage fine-tuned VITS model.
This framework aims to address the challenges of conventional ASR, TTS, and LLM systems with atypical speech patterns and interaction styles of seniors and children, enabling empathetic and effective intergenerational digital interactions.

PromptArmor: Simple yet Effective Prompt Injection Defenses

PromptArmor: introduces, "a simple yet effective defense against prompt injection attacks", with Guardrail LLM (off-the-shelf LLM), Prompting Strategy (carefully designed prompts), Detection (identifies injected prompts), Extraction (isolates malicious content), and Sanitization (removes injected prompts), where it functions as a guardrail layer to detect and remove malicious prompts from agent inputs before processing.
This defense leverages the text understanding and pattern recognition capabilities of an off-the-shelf LLM to analyze data samples and identify inconsistencies introduced by injected prompts.
PromptArmor operates as a standalone preprocessing component, ensuring minimal disruption to existing LLM-based systems and allowing the agent to complete its intended user task with sanitized data.

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

MoR (Mixture-of-Recursions): introduces a unified framework for LLMs that combines parameter sharing and adaptive computation, featuring a Recursive Transformer with Shared Stack of Layers/Recursion Block, a Router with Expert-choice routing/Token-choice routing, and a KV Caching Strategy with Recursion-wise KV caching/Recursive KV sharing.
This framework dynamically assigns token-level recursion depths via lightweight routers and selectively caches Key-Value pairs, focusing quadratic attention computation only on active tokens to improve memory access efficiency.
MoR establishes a new Pareto frontier for LLM efficiency, significantly lowering validation perplexity and improving few-shot accuracy while delivering higher throughput compared to existing baselines.

Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario

RMARL (Red-Team Multi-Agent Reinforcement Learning): introduces a framework where red-team agents, trained using a DC-GPPO algorithm with GCN and MLP, actively interfere with autonomous vehicles (AVs) in emergency braking scenarios, leveraging a CGMDP and PTZ model to generate high-risk corner cases.
The framework redefines background vehicles as red-team agents, enabling them to explore and uncover safety-critical scenarios beyond typical data distributions by maximizing AV collision rates while adhering to traffic regulations.
The PTZ model quantifies the threat posed by red-team vehicles, encouraging more extreme adversarial behaviors, and the DC-GPPO algorithm applies dual constraints to ensure realistic and disruptive interference.

The Constitutional Controller: Doubt-Calibrated Steering of Compliant Agents

CoCo (Constitutional Controller): introduces a novel framework for doubt-calibrated steering of compliant agents, integrating a Constitution (agent's structured knowledge base), a Doubt Model (neural self-doubt probability density), Probabilistic Inference, Plan & Control, and Online Compliance Validation.
The framework enhances agent safety and reliability by reasoning over deep probabilistic logic programs representing constraints and learning self-doubt from contextual features.
CoCo's adaptive behavior, demonstrated in UAV navigation, allows agents to account for external constraints and internal uncertainties, leading to compliant and crash-free operations.

The Emergence of Deep Reinforcement Learning for Path Planning

DQN (Deep Q-Network) Algorithm: is illustrated as a path planning model for marine search and rescue vessels, including an Environment, Actions, Estimation Q-network, Target Q-network, Reward Function, Experience Replay Memory, Gradients, Loss Function, and Update after N steps, designed to optimize navigation strategies.
This model enables autonomous agents to learn optimal navigation policies through interactive learning with the environment, aiming to maximize cumulative rewards for efficient search paths.
The architecture incorporates a target network for stable Q-value references and experience replay to decorrelate learning samples, enhancing the algorithm's stability and adaptability.

LaViPlan : Language-Guided Visual Path Planning with RLVR

LaViPlan (Language-Guided Visual Path Planning with Reinforcement Learning with Verifiable Rewards): introduces a framework that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Vision-Language Models (VLMs) for autonomous driving, addressing vision-language-action misalignment by integrating a policy model, a reference model, and verifiable rewards.
The framework operates in two phases: supervised fine-tuning of a VLM, followed by reinforcement fine-tuning where the policy model is optimized using Group Relative Policy Optimization (GRPO) with rewards based on output format adherence and trajectory accuracy.
This approach aims to steer VLMs toward context-aware decision-making consistent with situational reasoning, improving performance in out-of-distribution scenarios by explicitly optimizing planning-oriented metrics.

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

ChatBattery: introduces an expert-guided LLM reasoning platform for battery materials discovery, featuring two phases (Exploration and Exploitation), eight sequential stages, and seven specialized agents, including LLM, Domain, Search, Decision, Retrieval, Rank, and Human Agents.
This framework integrates domain knowledge to steer LLMs towards effective reasoning, enabling the identification, synthesis, and characterization of novel lithium-ion battery cathode materials.
The platform's AI-driven approach significantly reduces the time for experimental screening and validation, demonstrating the transformative potential of LLM-augmented discovery pipelines.

Deep Researcher with Test-Time Diffusion

TTD-DR (Test-Time Diffusion Deep Researcher): introduces a novel framework that conceptualizes research report generation as a diffusion process, iteratively refining a preliminary draft through denoising and self-evolution, leveraging LLM-powered agents for each stage.
The framework initiates with a noisy draft and a research plan, which are then refined via a continuous feedback loop incorporating external information through a retrieval mechanism.
This draft-centric design, enhanced by component-wise self-evolution, ensures timely and coherent report writing while minimizing information loss during the iterative search process.

Making REST APIs Agent-Ready: From OpenAPI to Model Context Protocol Servers for Tool-Augmented LLMs

AutoMCP (Automated Model Context Protocol Compiler): introduces a compiler that automates the generation of Model Context Protocol (MCP) servers from OpenAPI specifications, with components including Input Parsing and Dialect Resolution, Spec Normalization and Flattening, Authentication Analysis and .env Generation, Stub Generation and Handler Synthesis, and Output Layout and Transport Configuration.
The framework aims to streamline the integration of REST APIs into LLM workflows by transforming OpenAPI definitions into callable MCP tools, thereby reducing manual glue code and hardcoded prompts.
This approach addresses the engineering bottleneck of manually constructing MCP servers, enabling dynamic tool discovery and invocation for tool-augmented LLMs.

A Pilot Study on LLM-Based Agentic Translation from Android to iOS: Pitfalls and Insights

Multi-Agent Translation Pipeline: introduces an LLM-based agentic approach for mobile application translation from Android to iOS, with Specification Extraction, Code Translation, and Code Validation Agents, where it evaluates LLM performance, identifies key failure points, and proposes improvement guidelines.
The study evaluates the approach on five diverse Android projects, manually analyzing translated code for syntactic correctness, semantic accuracy, and functional completeness.
It identifies 10 types of translation failures across method, file, and package levels, underscoring challenges in platform-aware translation and the need for robust validation.

HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs

HyDRA (Hybrid-Driven Reasoning Architecture): introduces a framework for verifiable Knowledge Graph (KG) automation, integrating symbolic knowledge and neural networks, with components including stakeholder, persona, scope document, competency question, ontology, and KG generation modules, all guided by verifiable contracts.
The architecture operationalizes Design-by-Contract principles and the SymbolicAI framework, orchestrating an LLM-driven pipeline with a closed-loop verification and repair mechanism that enforces structural invariants and type consistency.
This approach aims to improve the reliability of automated KG construction by ensuring traceability from high-level requirements to low-level data and providing an evaluation framework for functional correctness.

Towards Mitigation of Hallucination for LLM-empowered Agents: Progressive Generalization Bound Exploration and Watchdog Monitor

HalMit: introduces a novel black-box watchdog framework that models the generalization bound of LLM-empowered agents to detect hallucinations, without requiring internal knowledge of the LLM's architecture, with Core Agent (coordinates interactions), Query Generation Agent (generates queries), Target LLM (LLM-powered agent), Evaluation Agent (evaluates responses), Vector Database (stores generalization bound points), Policy Network (adjusts fractal probabilities), Probabilistic Fractal Sampling (query generation method), Generalization Bound Exploration (identifies generalization bound), and Watchdog Monitor (monitors hallucinations).
The framework employs a multi-agent system, including a Core Agent, Query Generation Agents, a Target LLM, and an Evaluation Agent, to explore and identify the generalization bound.
It utilizes probabilistic fractal sampling guided by a Policy Network to efficiently generate queries and store identified boundary points in a Vector Database for real-time hallucination monitoring.

Advancing Responsible Innovation in Agentic AI: A study of Ethical Frameworks for Household Automation

IAIDF: introduces a comprehensive methodology for designing ethical and inclusive agentic AI in household automation, encompassing ethical foundation, co-design with vulnerable groups, user agency and privacy control, socio-technical bias reduction, household dynamics simulation, ethical data mining, and real-world deployment and societal impact.
This framework emphasizes integrating ethical considerations from the initial design phase, ensuring user participation, maintaining user control over data and AI actions, and mitigating biases through socio-technical approaches.
The methodology utilizes simulations and real-world deployments to validate AI systems, aiming to foster trust, enhance usability, and ensure equitable outcomes for diverse users in smart home environments.

An LLM Driven Agent Framework for Automated Infrared Spectral Multi Task Reasoning

LLM Driven Agent Framework: introduces an end-to-end LLM-driven agent framework that integrates an Input Module (receives queries/spectral data), an LLM Agent (orchestrates tasks/performs reasoning) with Entity Extraction (identifies research object/task), Function Call (invokes spectral processing), Multi-task Generation (performs classification/regression/anomaly detection), and Multi-turn Generation Enhancement (refines predictions iteratively), a Structured Paper Database (curated IR publications knowledge), a Retrieval Algorithm (searches knowledge base), a Spectral Processing Module (applies preprocessing/feature extraction), a Hard Samples Module (identifies/feeds mispredicted samples), and an Output Module (provides analytical results).
The framework leverages few-shot learning and a multi-turn conversational protocol, where hard samples are iteratively appended to prompts, to dynamically refine predictions and improve performance under low-data conditions.
This approach combines domain-specific reasoning with generalizable inference capabilities, establishing a new paradigm for intelligent, scalable infrared spectral analysis.

20th July 2025

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

WebShaper (Formalization-Driven IS Data Synthesis Framework): introduces a novel formalization-driven framework for synthesizing high-quality information-seeking (IS) training data, leveraging set-theoretic constructs and an agentic Expander module for systematic task generation and expansion.
This framework addresses data scarcity and inconsistency in IS agent development by formalizing tasks as Knowledge Projections, enabling precise control over reasoning structures and complexity.
WebShaper's approach, including seed task construction, agentic expansion with specialized tools, and robust training methodologies, yields state-of-the-art performance for open-sourced IS agents on benchmarks like GAIA and WebWalkerQA.

LibLMFuzz: LLM-Augmented Fuzz Target Generation for Black-box Libraries

LibLMFuzz: introduces a framework that pairs an agentic LLM with a lightweight toolchain (disassembler, compiler, fuzzer) within a sandbox, orchestrated by middleware.
This system autonomously analyzes stripped binaries, plans fuzz strategies, generates drivers, and iteratively self-repairs build or runtime errors for black-box libraries.
The framework significantly reduces costs associated with fuzzing closed-source libraries by achieving 100% API coverage with no human intervention.

EduThink4AI: Translating Educational Critical Thinking into Multi-Agent LLM Systems

EDU-Prompting: introduces a novel multi-agent framework for translating educational critical thinking into LLM systems, with Agent I (brainstorms initial answers), Agent II (validates answer existence), Agent III (critiques raw answers), Agent IV (synthesizes final answer), User Prompt Generator (collects user input), Stage Classifier (classifies learning stage), Vocabulary Module (processes vocabulary), Vocab Fetcher (identifies vocabulary terms), WordNet (enriches vocabulary data), Vocab Explainer (generates vocabulary explanations), Writing Assessor (evaluates writing content), Topic Module (analyzes user topics), Topic Identifier (identifies primary topics), Prompt Generator (creates topic prompts), Prompt Aggregator (synthesizes aggregated prompts), Reasoning Module (orchestrates critical thinking), and Final Response Generator (generates comprehensive response).
The framework significantly enhances content truthfulness and logical soundness in AI-generated educational responses by fostering diverse perspectives and analytical reasoning.
Its modular design allows seamless integration into existing educational applications, enabling practitioners to incorporate critical thinking catalysts and multiple perspectives without extensive system modifications.

LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading

LLM-MARL: introduces an integrated framework for real-time P2P energy trading, with LLM Expert Workflow (Generates expert strategies), MARL (Learns optimal policies), and P2P Energy Trading Environment (Simulates energy market) components, designed to bridge expert knowledge with agent learning for efficient energy market decision-making.
The framework replaces human experts with LLMs to guide MARL agents through imitation learning, significantly reducing manual intervention costs and integrating expert knowledge.
It employs a novel multi-agent imitation learning algorithm with a Wasserstein metric and a differential multi-head attention-based Critic network to enhance policy evaluation and accelerate learning.

Byzantine-Robust Decentralized Coordination of LLM Agents

DecentLLMs: introduces a decentralized consensus approach for multi-agent LLM systems, including worker agents (generate answers in parallel), evaluator agents (score answers/aggregate scores/select best answer/record on blockchain/reply to user), Geometric Median (GM) Algorithm (Byzantine-robust score aggregation), Blockchain (record transactions/auditable records), and Byzantine Reliable Broadcast Protocols (ensure consistent message delivery), designed to overcome limitations of leader-driven coordination.
This framework enables faster consensus and consistently selects higher-quality answers by allowing worker agents to generate responses concurrently and evaluator agents to independently score and rank them using Byzantine-robust aggregation techniques.
The system effectively tolerates Byzantine agents and significantly improves the quality of selected answers compared to traditional leader-based quorum voting methods.

Redefining Elderly Care with Agentic AI: Challenges and Opportunities

Agentic AI (Agentic Artificial Intelligence): introduces a comprehensive review of LLM-powered Agentic AI's transformative potential in elderly care, covering its applications, challenges, and ethical considerations for personalized, autonomous support.
The paper details Agentic AI's applications in personalized health management, cognitive support, emotional companionship, and enabling independence and inclusivity for older adults.
It also critically examines associated challenges, including data privacy, reliability, and integration issues, proposing a human-centered framework for responsible and equitable deployment.

INSIGHTX AGENT: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis

INSIGHTX AGENT: introduces an LLM-based agentic framework for reliable X-ray NDT analysis, with LMM Agent Core (orchestrates process), Large Language Model (reasoning, intent recognition), Lora Layer (domain adaptation), Image Encoder (visual feature processing), Tokenizer (text input processing), Sparse Deformable Multi-Scale Detector (SDMSD) (defect localization), CNN Backbone (extracts multi-scale features), Proposal Generation (generates, refines proposals), Deformable Attention Mechanisms (refines sparse proposals), Evidence-Grounded Reflection (EGR) Tool (validates, refines proposals), Context Assessment (evaluates image characteristics), Individual Defect Analysis (evaluates each proposal), False Positive Elimination (applies rejection criteria), Confidence Recalibration (adjusts confidence scores), and Quality Assurance (verifies output consistency).
The framework positions an LLM as a central orchestrator, coordinating specialized tools like SDMSD for defect detection and EGR for reflective validation, moving beyond passive data processing to active reasoning.
This approach enhances diagnostic reliability, interpretability, and interactivity in X-ray NDT by integrating high-precision detection with structured, evidence-grounded reasoning and self-assessment.

Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree

Browser Gym Agent: introduces a system vulnerable to Indirect Prompt Injection (IPI) attacks, demonstrating how a malicious actor can manipulate its web navigation behavior by embedding adversarial triggers in webpage HTML.
The system leverages the Greedy Coordinate Gradient (GCG) algorithm to optimize universal adversarial triggers, which are then inserted into the HTML accessibility tree parsed by the LLM.
This research highlights critical security risks, including login credential exfiltration and forced ad clicks, emphasizing the urgent need for stronger defenses in LLM-driven autonomous web agents.

STL-GO: Spatio-Temporal Logic with Graph Operators for Distributed Systems with Multiple Network Topologies

STL-GO (Spatio-Temporal Logic with Graph Operators): introduces a novel logic for specifying and verifying complex multi-agent system requirements, featuring an outer logic (system-wide reasoning), an inner logic (agent-specific reasoning), and graph operators (quantifies agent interactions) represented by a graph operator tree (operator relation representation).
This framework extends signal temporal logic by incorporating graph operators to quantitatively reason over multiple asymmetric network topologies, enabling distributed monitoring.
The distributed monitoring algorithm allows individual agents to determine specification satisfaction using only local information, demonstrated in bike-sharing and multi-drone case studies.

FROM KICKING TO CAUSALITY: SIMULATING INFANT AGENCY DETECTION WITH A ROBUST INTRINSIC REWARD

CAIS (Causal Action Influence Score): introduces a novel, model-based intrinsic reward for robust agency detection in noisy environments, utilizing a MIMo-Mobile Environment, an Embodied Agent (MIMo) with a Visual Encoder and Agent Architecture, driven by a Reinforcement Learning Framework with an Expected SARSA Algorithm, and a Reward Module that calculates CAIS via Quantile Regression and Wasserstein Distance, alongside a Surprise Signal, Mobile Trajectory Length, and Representation Trajectory Length, all optimized by AdamW Optimizer.
The paper demonstrates that CAIS enables the agent to distinguish self-generated effects from environmental noise, leading to a robust sense of agency that generalizes to unpredictable scenarios.
The framework also successfully reproduces the "extinction burst" phenomenon by augmenting CAIS with a surprise signal, highlighting the psychological plausibility of the causal inference approach.

Search-Based Autonomous Vehicle Motion Planning Using Game Theory

N-MP (Nash Motion Planner): introduces a search-based interactive motion planning scheme for autonomous vehicles, incorporating Dynamic Equation Derivation, Objective Function Formulation, Nash Equilibrium Identification, and Ego-AV Speed Modification.
This novel approach models other road users as intelligent agents within a game-theoretic framework, generating realistic and safer paths for autonomous vehicles.
The framework demonstrates low computational time and adaptability to various vehicle dynamics and road users, making it suitable for complex traffic scenarios and real-time applications.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

SE 3.0 (Agentic Software Engineering): introduces AIDev, a large-scale dataset, to empirically study how Autonomous Coding Agents (AI teammates), Human Developers (human collaborators), Review Bots (automated code reviewers), GitHub Repositories (software project hosts), and Pull Requests (code change proposals) are reshaping software engineering.
The paper analyzes 456,535 Agentic PRs from five leading LLM-powered agents, revealing their contributions, acceptance rates, and review dynamics compared to human-authored PRs.
Key findings highlight agents' speed in code submission, lower PR acceptance rates for complex tasks, and the increasing role of review bots, underscoring the need for new SE methodologies.

AgentFly: Extensible and Scalable Reinforcement Learning for LM Agents

AgentFly: introduces a scalable and extensible Agent-RL framework, with an Agent Module (manages agent workflow) and an RL Training Module (executes reinforcement learning), designed to empower LM agents with diverse RL algorithms.
The framework supports multi-turn interactions by adapting traditional RL methods with token-level masking and features a decorator-based interface for defining tools and reward functions.
It implements asynchronous execution of tool calls and reward computations, alongside a centralized resource management system, to support high-throughput training and scalable environment coordination.

HMARL-CBF – Hierarchical Multi-Agent Reinforcement Learning with Control Barrier Functions for Safety-Critical Autonomous Systems

HMARL-CBF: introduces a novel hierarchical multi-agent reinforcement learning approach, with High-Level Policy (learns joint cooperative behavior), Low-Level Policy (learns safe individual behavior), CBF-Based Policy (executes skills safely), High-Level Policy Network (implements high-level policy), Low-Level Policy Parameter Network (implements low-level policy), Skills (predefined safety-constrained actions), Control Barrier Functions (enforce pointwise safety), Control Lyapunov Functions (guide skill execution), Extrinsic Trajectory Return (optimizes joint performance), and Intrinsic Trajectory Return (learns individual skills), designed for safe policy learning in multi-agent safety-critical autonomous systems by decomposing the problem into two levels.
The framework ensures safety guarantees during both training and real-world deployment by integrating Control Barrier Functions for pointwise-in-time safety constraints and utilizing a skill-based hierarchical structure.
The approach validates its effectiveness on challenging multi-agent traffic scenarios, demonstrating superior safety compliance and improved performance compared to existing methods.

Can Mental Imagery Improve the Thinking Capabilities of AI Systems?

The Machine Thinking Framework: introduces a comprehensive framework integrating mental imagery to enhance AI thinking capabilities, featuring a Cognitive Thinking Unit, Needs Unit, Input Data Unit, and Mental Imagery Unit.
This framework enables AI systems to reason, plan, and infer decisions autonomously by processing sensory inputs and internally generated representations.
It addresses limitations of current AI models by simulating human-like cognitive processes, bridging perception, reasoning, and imagination.

StaAgent: An Agentic Framework for Testing Static Analyzers

STAAGENT (An Agentic Framework for Testing Static Analyzers): introduces an LLM-driven agentic framework for systematically evaluating static analyzer rules, including a Seed Generation Agent (generates bug-inducing programs), a Code Validation Agent (validates seeds, generates tests), a Mutation Generation Agent (creates semantically equivalent mutants), and an Analyzer Evaluation Agent (compares analyzer behavior).
The framework leverages LLMs to synthesize, mutate, and validate code snippets, performing metamorphic testing to uncover inconsistencies in static analyzer rule implementations.
This approach offers a scalable and adaptable solution to improve the reliability of static analyzers by identifying flaws in rule implementations through inconsistent behaviors.

Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture

RBAMA (Reason-Based Artificial Moral Agent): introduces an extended reinforcement learning architecture that integrates an ethics module with a reasoning unit to enable moral decision-making based on normative reasons and iterative refinement through case-based feedback from a moral judge.
The framework includes a reasoning unit operating on a learned reason-theory, moral policies for fulfilling moral goals, moral filters for enforcing moral constraints, and an instrumental policy for task achievement.
This modular design ensures behavioral conformity to inferred moral obligations, enhances moral trustworthiness and robustness, and allows for moral justification of the agent's actions.

Active Probing with Multimodal Predictions for Motion Planning

APMP (Active Probing with Multimodal Predictions): introduces a unified framework that combines trajectory planning, multimodal predictions, and active probing to enhance decision-making under uncertainty, integrating utility maximization, safety assessment, and information maximization.
The framework develops a novel risk metric that seamlessly integrates multimodal prediction uncertainties through mixture models, proving analytical tractability with a closed-form solution.
It incorporates an active probing mechanism to strategically select actions for improving estimates of other agents' behavioral parameters, demonstrating robust performance in complex traffic scenarios.

19th July 2025

Towards AI Urban Planner in the Age of GenAI, LLMs, and Agentic AI

AI Urban Planner: introduces a conceptual framework for automated urban planning, integrating a Generative Urban Planning Framework with Representation and Generation stages, LLMs, Agentic AI, Digital Twins, and a Human-Machine Co-design Interface.
The framework aims to synthesize optimal land-use configurations by encoding diverse urban contexts into structured embeddings and generating plans conditioned on geospatial, social, and human-centric constraints.
This approach seeks to augment human expertise, democratize planning insights, and enable adaptive, customizable urban design solutions.

NEO: A CONFIGURABLE MULTI-AGENT FRAMEWORK FOR SCALABLE AND REALISTIC TESTING OF LLM-BASED AGENTS

Neo: introduces a configurable multi-agent framework for scalable and realistic testing of LLM-based agents, including a Question Agent (generates test inputs), an Evaluation Agent (assesses target agent output), and a Context Hub (stores test context/history) to simulate human-like conversations and evaluate LLM systems.
The framework leverages a probabilistic state model to control dialogue flow, emotional tone, and topical intent, enabling dynamic variation across multi-turn test cases and uncovering edge cases.
Neo's architecture supports both pre-deployment testing and post-launch monitoring, aiming for self-evolution through memory-driven refinement and continuous improvement of LLM testing.

Agentic Satellite-Augmented Low-Altitude Economy and Terrestrial Networks: A Survey on Generative Approaches

Agentic AI: introduces a survey on how Agentic AI, empowered by Generative AI (GAI) models (Variational Autoencoders, Generative Adversarial Networks, Generative Diffusion Models, Transformer-Based Models) and LLMs, enhances perception, reasoning, and action capabilities within Satellite-Augmented Low-Altitude Economy and Terrestrial Networks (SLAETNs).
This approach addresses challenges in SLAETNs by enabling autonomous decision-making, resource-constrained sensing, and secure cross-domain coordination.
The survey provides a model-driven foundation, comparative analysis, and future directions for building scalable, adaptive, and trustworthy generative agents in integrated networks.

AMICO: AN EVENT-DRIVEN MODULAR FRAMEWORK FOR PERSISTENT AND EMBEDDED AUTONOMY

AMICO (An Event-Driven Modular Framework for Persistent and Embedded Autonomy): introduces an event-driven, modular agent framework designed for persistent and embedded autonomy, featuring distinct layers (Environment, Interaction, AI Agent, Engine) and core components like Event Generator, Action Selector, and integrated memory systems.
Implemented in Rust for performance and safety, AMICO supports reactive agents operating across embedded systems and browser environments via WebAssembly (WASM), enabling robust and efficient real-world deployment.
The framework provides clear abstractions for event processing, agent state management, behavior execution, and LLM-based reasoning integration, facilitating resilient, interactive, and persistent agent behavior under resource constraints.

Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Routine (A Structural Planning Framework for LLM Agent System in Enterprise): introduces a multi-step agent planning framework with Planning Module (generates step-by-step plan), Execution Module (follows plan, generates tool call instructions), Tool Module (receives instructions, returns execution results), and Memory Module (stores context), where it provides a clear structure, explicit instructions, and seamless parameter passing to guide an agent's execution module in performing multi-step tool-calling tasks with high stability.
The framework significantly increases execution accuracy in model tool calls, improving performance of LLMs like GPT-4o and Qwen3-14B in real-world enterprise scenarios.
Routine also enables the distillation of domain-specific tool-usage patterns and enhances model adaptability to new scenarios, accelerating the deployment and adoption of agent systems.

When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems

Self-Evolving Multi-Agent Collusion Framework: introduces a novel simulation framework for studying multi-agent collusion, incorporating components for agent coordination, behavior evolution, and platform-level intervention, where it simulates and analyzes how malicious agents coordinate and adapt in high-stakes environments like misinformation and e-commerce fraud.
The framework, built on the OASIS social simulator, demonstrates that decentralized malicious multi-agent systems are more effective and adaptive in spreading harm than centralized ones, even against traditional interventions.
It provides insights into malicious group operations and highlights the need for dynamic detection systems and countermeasures against evolving collusive behaviors.

LEARNING TO COMMUNICATE IN MULTI-AGENT REINFORCEMENT LEARNING FOR AUTONOMOUS CYBER DEFENCE

DIAL (Differentiable Inter-Agent Learning): introduces a multi-agent reinforcement learning framework for autonomous cyber defense, featuring blue agents with C-Nets that learn to communicate and take defensive actions within the CybORG simulation environment.
The framework enables blue agents to develop tactical policies akin to human experts, learning minimal cost communication messages while defending against cyber threats in various network configurations.
DIAL's approach, including Strategic Action Unmasking, allows agents to coordinate effectively and outperform agents requiring global state information, demonstrating practical applicability in enterprise network simulations.

18th July 2025

DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

DPMT (Dual Process Multi-scale Theory of Mind Framework): introduces a novel framework for real-time human-AI collaboration, featuring an Information Extractor, a Fast System for intuitive decision-making, a Slow System with a multi-scale ToM module for cognitive reasoning, an Action Decoding Module, and a Memory component.
The framework leverages a dual-process approach, where the Fast System handles immediate macro-action decisions using a smaller LLM, while the Slow System, powered by LLMs, performs deeper, multi-scale ToM reasoning to model human partners' domain knowledge, cognitive style, and intentions.
This hierarchical design enables efficient human-AI collaboration by integrating quick decision-making with robust human partner modeling, enhancing adaptability and interpretability in complex, dynamic scenarios.

CodeEdu: A Multi-Agent Collaborative Platform for Personalized Coding Education

CodeEdu: introduces a multi-agent collaborative platform for personalized coding education, leveraging its Tool Pool (external utilities), Agent Pool (specialized LLM agents), and Task Pool (standard task types) to dynamically allocate agents and tasks for proactive and personalized learning.
The platform's workflow encompasses Personalized Material Generation, Real-Time Q&A, Step-by-step Code Tutoring with Debugging, and Learning Report Generation, facilitated by dynamic agent and task allocation.
Automated evaluations demonstrate CodeEdu's efficacy in substantially enhancing students' coding performance and providing high-quality learning materials compared to baseline LLM tutors.

AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework

AGENTS-LLM (Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework): introduces an LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions, featuring a Scenario Modifier Agent, a Toolbox, and an optional Quality Assurance loop with Text QA and Visual QA agents.
This framework addresses the limitations of manual scenario augmentation by domain experts, enabling scalable generation of challenging and safety-critical driving scenarios.
The agentic design provides fine-grained control over the output and allows smaller, cost-effective LLMs to achieve performance comparable to larger models.

COGNIQ-H: A SOFT HIERARCHICAL REINFORCEMENT LEARNING PARADIGM FOR AUTOMATED DATA PREPARATION

CogniQ-H: introduces a soft hierarchical reinforcement learning paradigm for automated data preparation, synergistically fusing a Large Language Model (LLM) as a high-level planner, a Learning-to-Rank (LTR) model for immediate quality scores, and an RL Q-model for long-term value estimates, integrated by a synergistic policy layer.
This framework addresses the combinatorial search space of data preparation by providing probabilistic, LLM-driven strategic guidance, avoiding the rigid commitments of traditional hard hierarchical reinforcement learning.
The framework balances pre-existing knowledge, supervised signals, and adaptive learning to achieve robust and efficient pipeline discovery, outperforming state-of-the-art RL-based methods in pipeline quality and convergence speed.

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

CUDA-L1 (Improving CUDA Optimization via Contrastive Reinforcement Learning): introduces an automated reinforcement learning framework for CUDA optimization, which leverages a three-stage pipeline including Supervised Fine-tuning, Self-supervised Learning, and Contrastive Reinforcement Learning to enhance optimization by distinguishing between effective and ineffective CUDA strategies through comparative analysis of generated variants and their execution performance.
The framework achieves significant speedups (average 17.7x, peak 449x on NVIDIA A100) across 250 KernelBench CUDA kernels and demonstrates strong portability across various GPU architectures.
CUDA-L1 autonomously discovers diverse optimization techniques, identifies optimal combinations, uncovers fundamental principles, and pinpoints hidden bottlenecks without human expertise, showcasing RL's potential in complex code optimization.

The Emotion-Memory Link: Do Memorability Annotations Matter for Intelligent Systems?

Conceptual Model of Emotion-Memory Link: introduces a framework investigating the relationship between perceived group emotions and group memorability in conversational interactions, including components like cognitive appraisal, experienced emotion, physiological reaction, behavior, observer annotation, memory encoding, and accessible memories.
The paper empirically examines if third-party affect annotations, commonly used in Affective Computing, reliably capture memory-relevant information in dynamic group settings.
The study concludes that the observed relationship between group affect and memorability annotations is not significantly different from random chance, questioning the utility of third-party affect annotations as proxies for conversational memorability.

Photonic Fabric Platform for AI Accelerators

PFA (Photonic Fabric Appliance): introduces a photonic-enabled switch and memory subsystem for AI accelerators, integrating Photonic Fabric Modules (PFM) with photonic and electronic components, and external DDR5 memory, to overcome memory bottlenecks and scale AI workloads.
The system provides up to 32 TB of shared memory and 115 Tbps of all-to-all digital switching, enabling more efficient distributed AI training and inference.
Evaluated using the CelestiSim simulator, PFA demonstrates significant throughput and latency improvements for LLM inference and substantial energy savings for LLM training compared to conventional GPU-based systems.

NetIntent: Leveraging Large Language Models for End-to-End Intent-Based SDN Automation

NetIntent: introduces a unified and adaptable framework that leverages LLMs and non-LLM agents to automate the entire Intent-Based Networking (IBN) lifecycle, from high-level user intents to low-level Software-Defined Networking (SDN) configurations.
The framework orchestrates LLMs for intent translation, conflict detection, and corrective actions, while non-LLM agents handle validation, resolution, deployment, and assurance tasks.
NetIntent supports dynamic re-prompting and contextual feedback, enabling robust execution of user-defined intents with minimal human intervention across OpenDaylight (ODL) and Open Network Operating System (ONOS) SDN controllers.

WebGuard: Building a Generalizable Guardrail for Web Agents

WebGuard: introduces a generalizable guardrail system for web agents, comprising the WebGuard Dataset (human-annotated action risk levels), a three-tier Risk Schema (action risk categorization), and a Guardrail Model (LLM, predicts action risk level) that processes Observation Space (webpage state input) and Action Space (proposed web agent action), integrated with Human-in-the-loop Control (user intervention mechanism) and Annotation Tools (dataset creation and labeling).
The system addresses the urgent need for effective safety measures for LLM-powered web agents by predicting the outcome of state-changing actions using a dataset of 4,939 human-annotated actions across diverse websites and domains.
Evaluations reveal that frontier LLMs struggle with action outcome prediction and high-risk recall, emphasizing the necessity of dedicated safeguards and specialized fine-tuning for reliable web agent deployment.

DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation

DREAMS (Density Functional Theory Based Research Engine for Agentic Materials Screening): introduces a hierarchical, multi-agent framework for DFT simulation, featuring a Supervisor LLM Agent (Generates/Updates Plans/Assigns Tasks), DFT LLM Agent (Manages DFT Calculations/Structure Generation/Parameter Optimization/Output Analysis), Convergence LLM Agent (Suggests Fixes/Resolves Convergence Issues), and HPC LLM Agent (Allocates Resources/Submits/Monitors Jobs), all interacting via a Canvas (Shared Information Dashboard/Context Preservation) and utilizing an HPC Cluster (High-Performance Computing Environment).
This framework automates high-fidelity Density Functional Theory simulations, addressing challenges like parameter fine-tuning and systematic error handling, thereby reducing human intervention.
DREAMS achieves L3-level automation in materials discovery, demonstrating expert-level accuracy in lattice constant calculations and complex problem-solving for adsorption puzzles.

ADAPTIVE MULTI-AGENT REASONING VIA AUTOMATED WORKFLOW GENERATION

Nexus Architect: introduces an enhanced multi-agent system framework that autonomously generates and refines reasoning workflows from user prompts and examples, integrating User Prompt, Examples, Nexus Documentation, Task Decomposition & Planning, Reasoning Workflow Design, Supervisor Builder, Agent Builder, Tool Builder, Workflow Validation & Testing, Performance Assessment, Feedback, Iterative Prompt Refinement (IPR), Prompt Engineering, Validated Reasoning Graph, and Nexus Runtime Environment.
This framework systematically decomposes complex inferential reasoning tasks, instantiates multi-agent architectures, and iteratively tunes agent system prompts to maximize performance and improve generalization capabilities using standard, non-reasoning LLMs.
The framework leverages a feedback-driven prompt engineering mechanism to achieve automated reasoning, enabling robust and generalizable problem-solving without requiring specialized LLM training or fine-tuning.

COGNIQ-H: A SOFT HIERARCHICAL REINFORCEMENT LEARNING PARADIGM FOR AUTOMATED DATA PREPARATION

CogniQ-H: introduces a soft hierarchical reinforcement learning paradigm for automated data preparation, with its Macro-Stage Layer (high-level planning) where an LLM (strategic prior generation) generates a strategic prior, a Micro-Stage Layer (evidence provision) where an LTR Model (immediate quality scoring) provides immediate quality scores and an RL Q-Model (long-term reward estimation) estimates long-term rewards, and a Synergistic Policy Layer (action integration) where a Synergistic Policy (final action selection) integrates these signals for action selection.
This framework formulates action selection as a Bayesian inference problem, allowing it to balance high-level strategic guidance from the LLM with adaptive, evidence-based decision-making from the LTR model and RL Q-model.
The approach achieves improved pipeline quality and faster convergence by avoiding the rigid commitments of traditional hard hierarchical RL, enabling robust and efficient discovery of optimal data preparation pipelines.

17th July 2025

A Survey of Context Engineering for Large Language Models

Context Engineering: introduces a formal discipline for systematic optimization of information payloads for LLMs, with foundational components for context retrieval, processing, and management, and system implementations including RAG, memory systems, tool-integrated reasoning, and multi-agent systems.
The paper provides a comprehensive taxonomy classifying techniques into foundational components for context generation, processing, and management, and sophisticated system implementations for real-world applications.
The survey identifies a critical research gap where LLMs excel at understanding complex contexts but show limitations in generating equally sophisticated, long-form outputs, highlighting a key priority for future research.

Change of Thought: Adaptive Test-Time Computation

SELF-Transformer: introduces a novel architecture that augments self-attention with Fixed-Point Iteration (FPI) to enable latent alignment refinement, where it iteratively updates attention weights to a fixed point, scaling test-time computation with input difficulty.
This framework achieves deeper contextual reasoning without additional parameters by leveraging FPI universally across all layers, improving latent representations without token-level autoregression.
The approach employs dynamic parameter reuse and implicit differentiation for efficient gradient computation, ensuring scalability and stability while adapting to input complexity.

Prompt Injection 2.0: Hybrid AI Threats

Layered Defense Architecture: introduces a robust defense against hybrid AI threats, combining Preamble's trusted/untrusted classification, CaMeL's architectural isolation, Spotlighting, and traditional controls.
This architecture addresses prompt injection attacks by distinguishing trusted instructions from untrusted inputs, isolating control and data flows, explicitly marking untrusted content, and leveraging existing security measures.
The paper details how these components work together to provide a scalable and comprehensive defense posture for LLM-integrated systems in complex, real-world environments.

SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

SE-VLN (Self-Evolving Vision-Language Navigation Framework): introduces a training-free VLN framework driven by MLLMs, encompassing a hierarchical memory module, a retrieval-augmented thought-based reasoning module, and a reflection module, where it endows VLN agents with the ability to continuously evolve during testing by simulating natural agent evolution processes.
The hierarchical memory module, comprising an experience repository and a verbal topological map, enables the agent to retrieve contextual memory and past similar experiences, crucial for enhancing navigation performance.
The framework's reflection module, with its outcome evaluator and experience corrector, facilitates continuous learning by analyzing task evaluation results and updating the experience repository with corrected decisions, promoting self-evolution.

RIDAS: A Multi-Agent Framework for AI-RAN with Representation- and Intention-Driven Agents

RIDAS: introduces a multi-agent framework for AI-RAN that unifies low-level representation control with high-level intent interpretation via its RDA and IDA components, where RDAs encode messages and control quality/rate, and IDA maps user intents to RDA configurations and manages resource allocation using an LLM, memory, and a two-stage planning pathway.
The framework addresses the gap between high-level user intents and low-level, parameterized configurations required for optimal AI-RAN performance by enabling efficient bandwidth allocation and QoS satisfaction.
RIDAS dynamically adjusts control parameters based on network conditions and user QoS requirements, achieving near-optimal performance in transmission rate and task performance demands.

Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication

IVS (Intelligent Virtual Sonographer): introduces a dual-LLM-driven embodied conversational agent that facilitates real-time, multidirectional communication between physicians, a robotic ultrasound system, and patients in an Extended Reality environment.
The system enhances efficiency, clarity, and accessibility of robotic ultrasound acquisition by translating physician commands into robotic actions and relaying system updates and empathetic explanations to patients.
It leverages two independent LLM instances for parallel physician- and patient-facing dialogues, integrating speech-to-text, text-to-speech, and robotic control for seamless interaction.

MAD-SPEAR: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems

MAD-SPEAR: introduces a conformity-driven prompt injection attack, with Attacker, Injected Data, Targeted Agents, Sybil Agent Simulation, Conformity Exploitation, Misinformation Propagation, Confidence Level Manipulation, Output Format Replication, and Communication Attack Integration, designed to compromise Multi-Agent Debate (MAD) systems by manipulating a small subset of LLM agents to propagate misinformation and degrade consensus quality.
The attack exploits LLMs' inherent conformity tendencies and can be combined with communication attacks to amplify its impact, significantly impairing task-solving accuracy and scalability.
The paper also proposes a formal definition of MAD fault-tolerance and a comprehensive evaluation framework, highlighting the urgent need for improved security in MAD system designs.

MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models

MCPEval (Model Context Protocol-based framework): introduces an open-source framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains, including MCP Server, MCP Client (Agent), Task-LLM, LLM Judger, Tool Call Evaluation, Ground Truth Trajectory, and Auto Report Generation.
The framework standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines, providing actionable feedback for optimizing LLM agent implementations.
MCPEval's automated workflow includes task generation, verification, and model evaluation, leveraging synthetic data and iterative refinement to ensure high-quality tasks and comprehensive analysis of agent behavior.

A Systematic Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models

Unified Taxonomy for EHR Modeling: introduces a comprehensive survey of Electronic Health Record (EHR) modeling, categorizing methods across data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems, where it provides a structured roadmap for advancing AI-driven EHR modeling and clinical decision support.
This survey systematically organizes recent advancements in deep learning and LLMs for EHRs, highlighting emerging trends like foundation models and LLM-driven clinical agents.
It discusses open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings, aiming to promote reproducibility and accessibility for new researchers.

Humans learn to prefer trustworthy AI over human partners

Partner Selection Game: introduces a communication-based partner selection game in a triadic setting where human selectors choose between human and LLM-powered bot candidates, examining partner selection dynamics and human adaptation under AI competition.
The framework utilizes LLMs (specifically OpenAI's GPT-4o) to simulate bot candidates, and employs computational models like the Rescorla-Wagner algorithm to analyze human selectors' belief updating and decision-making.
The study investigates the impact of identity transparency on partner selection, showing how it influences human learning about bot and human behavior and affects competitive outcomes in hybrid human-AI societies.

GraphTrafficGPT: Enhancing Traffic Management through Graph-Based AI Agent Coordination

GraphTrafficGPT: introduces a novel graph-based architecture that fundamentally redesigns task coordination for LLM-driven traffic applications, utilizing an Input Processing Module (decomposes user queries), Dependency Graph Generator (builds task graph), Brain Agent (central task coordinator), Specialized Agents (domain-specific task handlers), Multi-Agent Communication Protocol (MCP) (agent communication, synchronization), Tool Box (traffic foundation models), and Response Integration Module (combines agent outputs) to enable efficient parallel execution and dynamic resource allocation.
The system represents tasks and their dependencies as nodes and edges in a directed graph, allowing for concurrent multi-query processing and significant reductions in token consumption and response latency compared to chain-based approaches.
This architecture enhances scalability and efficiency for complex, real-world traffic management scenarios by orchestrating a network of specialized agents for data retrieval, analysis, visualization, and simulation.

Apple Intelligence Foundation Language Models Tech Report 2025

AFM (Apple Foundation Models): introduces two multilingual, multimodal foundation language models, an On-Device Model (compact LLM) and a Server Model (scalable LLM), detailing their architecture including KV Cache Sharing (on-device memory optimization) and Parallel Track Mixture-of-Experts (PT-MoE) (server sparse architecture), multimodal capabilities via a Vision Encoder (visual feature extraction), training methodologies like Supervised Fine-Tuning (SFT) (model refinement) and Reinforcement Learning from Human Feedback (RLHF) (alignment training), inference optimizations such as Quantization Aware Training (QAT) (on-device compression), Adaptive Scalable Texture Compression (ASTC) (server compression), and Low-Rank Adaptation (LoRA) Adapters (quality recovery), all integrated within a Foundation Models Framework (developer access) offering Guided Generation (constrained output), Tool Calling (external tool integration), and LanguageModelSession (context management), while adhering to Responsible AI principles (ethical guidelines).
The paper highlights architectural innovations like PT-MoE and KV-cache sharing for efficiency, alongside comprehensive data pipelines and advanced fine-tuning techniques to enhance model capabilities and privacy.
The models support multilingual and multimodal inputs, improve tool-use and reasoning, and are accessible to developers via a Swift-centric framework for integrating generative AI features into Apple applications.

Change of Thought: Adaptive Test-Time Computation

SELF-Transformer: introduces a novel architecture that augments self-attention with Fixed-Point Iteration (FPI) to enable latent alignment refinement, where it iteratively updates attention weights to a fixed point, scaling test-time computation with input difficulty.
This framework achieves deeper contextual reasoning without additional parameters by leveraging FPI universally across all layers, improving latent representations without token-level autoregression.
The approach employs dynamic parameter reuse and implicit differentiation for efficient gradient computation, ensuring scalability and stability while adapting to input complexity.

Model-free Reinforcement Learning for Model-based Control: Towards Safe, Interpretable and Sample-efficient Agents

Model-free Reinforcement Learning for Model-based Control: introduces a paradigm for developing safe, interpretable, and sample-efficient agents by adapting model-based agents, which include an internal model, a planning module, and a policy/Q-function, using model-free RL algorithms.
This approach leverages prior system knowledge embedded in the internal models to enhance sample efficiency and interpretability, with model-free RL addressing potential model inaccuracies.
The paper categorizes policy learning methods for these agents into derivative-free (e.g., Bayesian Optimization) and gradient-based (e.g., Policy Search RL) approaches, highlighting their distinct advantages and challenges.

iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development

iREDEV (Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development): introduces a knowledge-driven multi-agent framework for intelligent requirements development, with six knowledge-driven agents, an artifact pool, and a human-in-the-loop mechanism, designed to automate and enhance the software requirements development process.
The framework integrates human expert knowledge into agent design and utilizes an event-driven communication mechanism via a shared artifact pool to support dynamic and collaborative requirements development tasks.
The system employs LLMs as underlying intelligence for its agents and incorporates a human-in-the-loop mechanism to ensure generated artifacts align with stakeholder expectations and improve reliability.

Non-differentiable Reward Optimization for Diffusion-based Autonomous Motion Planning

Non-differentiable Reward Optimization for Diffusion-based Autonomous Motion Planning: introduces a reinforcement learning-based training scheme that optimizes diffusion motion planning models using non-differentiable objectives like collision and goal achievement, facilitated by a dynamic thresholding algorithm to shape dense reward signals.
This approach enables direct optimization of critical autonomy objectives, outperforming models trained with differentiable objectives on pedestrian datasets.
The method addresses sparse reward problems in autonomous motion planning by adaptively adjusting reward sparsity, ensuring stable learning and improved performance.

MACHINE-READABLE ADS: ACCESSIBILITY AND BEHAVIORAL PATTERNS OF AI WEB AGENTS INTERACTING WITH ONLINE ADVERTISEMENTS

AI Web Agent Advertising Interaction Evaluation: introduces a controlled experimental setup to assess the accessibility and behavioral patterns of AI web agents interacting with online advertisements, utilizing the Browser Use framework, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, OpenAI Operator, a faithful TT.com clone, and an experimental protocol.
The evaluation framework investigates how various LLM-powered agents perceive, interact with, and responsibly behave in an ad-heavy online environment, considering semantic markup, dynamic content, and ethical implications.
Key findings reveal agents' satisficing behavior, their preference for explicit DOM elements over purely visual cues, and model-specific risk profiles concerning financial commitments and consent handling.

Intent-Based Network for RAN Management with Large Language Models

IBNS (Intent-Based Network System): introduces a novel automation approach for RAN management, integrating LLMs within an agentic architecture that includes a Strategist Agent (translates intent, generates strategy), History Analyzer Agent (analyzes past strategies, provides insights), and interacts with a Radio Access Network (simulated network environment) via O-RAN O1 interface (standard RAN communication).
The system leverages a structured prompt engineering technique for LLM-driven intent translation, dynamically optimizing RAN parameters for energy efficiency through a closed-loop mechanism.
This approach enables robust resource management by adapting strategies based on real-time feedback, showcasing the potential of LLM-orchestrated agentic systems for autonomous network operation.

Public Evaluation on Potential Social Impacts of Fully Autonomous Cybernetic Avatars for Physical Support in Daily-Life Environments: Large-Scale Demonstration and Survey at Avatar Land

The demonstrated system: introduces a framework for fully autonomous Cybernetic Avatars (CAs) to provide physical support in daily-life environments, integrating User Instruction Input, Speech Recognition (Whisper), Posture Detection (MediaPipe), Exophora Resolution Model, LLM (GPT-4o), Multi-Robot Planning, Fetch Robotics Fetch, Preferred Robotics Kachaka, Containerized SDE, ROS, and Extended Reality Visualization (Meta Quest 3 XR headset) to enable object retrieval tasks.
This system was publicly evaluated at Avatar Land, assessing user perceptions and social impacts of autonomous CAs performing daily object retrieval in a replicated home environment.
The evaluation revealed public interest in CAs for daily support but highlighted concerns regarding task execution reliability, emphasizing the need for improved robot performance.

LightAutoDS-Tab: Multi-AutoML Agentic System for Tabular Data

LightAutoDS-Tab: introduces a multi-AutoML agentic system for tabular data, combining LLM-based code generation with multiple AutoML tools, and includes interactor, planner, generator, validator, improver, AutoML, executor, interpreter, and result aggregation components.
This framework enhances existing AutoML tools by integrating them with an LLM agent for flexible, data-aware code generation and configuration of ML pipelines, addressing limitations of fixed-pipeline and LLM-only approaches.
It streamlines end-to-end ML pipeline development, offering increased automation, reduced development time, and improved interpretability and quality for data science tasks.

16th July 2025

AIME: TOWARDS FULLY-AUTONOMOUS MULTI-AGENT FRAMEWORK

Aime: introduces a novel multi-agent framework with a Dynamic Planner (orchestrates tasks), Actor Factory (instantiates actors), Dynamic Actor (executes subtasks), and Progress Management Module (manages state), designed for dynamic, reactive planning and execution.
The framework replaces conventional static workflows with a fluid, adaptive architecture, continuously refining strategy based on real-time execution feedback and enabling on-demand agent specialization.
Aime addresses critical limitations of rigid plan execution, static agent capabilities, and inefficient communication in multi-agent systems, establishing a more resilient and effective foundation for collaboration.

Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data

Advanced RAG Framework: introduces an advanced Retrieval-Augmented Generation framework designed to effectively retrieve and generate responses from heterogeneous enterprise data, including text, structured documents, and tabular records, by combining optimized document preprocessing, hybrid retrieval strategies, advanced ranking mechanisms, and feedback-driven refinement.
The framework employs semantic and table-aware chunking, hybrid retrieval (dense embeddings and BM25), metadata-driven filtering, and cross-encoder reranking to enhance relevance and contextual alignment.
It further integrates interactive query refinement using LLMs and a human-in-the-loop feedback mechanism with conversational memory to improve system adaptability and response quality over time.

Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Multi-Agent Debate Framework: introduces a multi-agent debate framework, with User Request (Input instruction), Multi-Agent Debate Framework (Core system), LLM Agents (Processing units), Leader Agent (Proposes initial solution), Follower Agents (Evaluate proposals), Debate Rounds (Iterative process), Consensus Mechanism (Decision-making), and Clarification Question Generation (Output question), designed to enhance LLM detection and resolution of ambiguity in user requests through structured debate.
This framework employs multiple LLM agents (Llama3-8B, Gemma2-9B, Mistral-7B) in a leader-follower protocol to collaboratively analyze ambiguous instructions and generate clarifying questions.
The debate mechanism improves ambiguity detection and resolution, particularly for complex ambiguities, by leveraging diverse perspectives and iterative refinement, though its utility is model-dependent.

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon: introduces, "a novel Python-based benchmark", with all Inputs (problem context, requirements), Candidate Solution Generation (LLM/AI agent), Candidate Solution (generated code), Validation (executes tests), Hidden Tests (success evaluation), Visible Tests (self-debugging feedback), Self-Debug (iterative refinement), and Benchmark Success (evaluation outcome), where GitChameleon provides an execution-based benchmark for evaluating AI code generation against Python library version incompatibilities.
It comprises 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests.
The benchmark rigorously evaluates LLMs, LLM-powered agents, code assistants, and RAG systems for version-conditioned code generation, highlighting limitations in handling library versioning.

Next-Gen Museum Guides: Autonomous Navigation and Visitor Interaction with an Agentic Robot

Alter-Ego (Autonomous Museum Guide Robot): introduces an autonomous museum guide robot that integrates LLM-powered dialogue with advanced navigation capabilities, enabling real-time, context-aware Q&A and seamless navigation.
The system leverages components like ROS Hector SLAM, YOLOv10-n, Google's Speech-to-Text API, and OpenAI's GPT-4o mini for robust operation.
It dynamically adapts tours based on user requests and location, enhancing visitor engagement and knowledge acquisition in cultural settings.

Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes

Infherno: introduces an end-to-end agent-based framework for FHIR resource synthesis, with LLM Agent (Core processing unit), Prompt Structure (Guides agent behavior), Code Search (Queries external terminologies), Python Executor (Executes generated code), fhir.resources Python Module (Ensures FHIR compliance), Code Loop (Iterative refinement process), FHIR Bundle (Aggregates output resources), and Front End (User interface), where it transforms unstructured clinical notes into structured, semantically accurate FHIR representations.
The framework leverages LLM agents with tool-use capabilities and code execution to address challenges in generalizability and structural conformity in clinical data extraction.
It supports clinical data integration and interoperability by adhering to the FHIR document schema and performing well against human baselines in predicting FHIR resources.

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

LLM Cardinal Direction Reasoning Evaluation: introduces a comprehensive methodology for assessing LLMs' spatial reasoning, encompassing automated benchmark dataset generation (creates diverse questions), an LLM testing system (executes queries), and a performance analysis module (interprets results).
The benchmark includes 5760 questions derived from six templates, varying locomotion types, person forms, and cardinal/intercardinal directions to rigorously test LLM robustness.
The evaluation reveals that even state-of-the-art LLMs struggle with reliable cardinal direction reasoning, particularly with intercardinal directions and generalisation across different question parameters.

Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness

LLM Agent Simulation Framework: introduces a two-stage experimental workflow to investigate value similarity's influence on trust and interpersonal closeness between LLM agents, including value controllability assessment and mutual evaluation.
The framework first assesses LLM value controllability using prompts and PVQ, then simulates dialogues between value-assigned LLM agents, evaluating their mutual trust and interpersonal closeness.
This simulation demonstrates that higher value similarity leads to greater mutual trust and closeness, validating social science theories within an artificial society.

Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking biomarker

Graph Representations for Reading Comprehension Analysis Framework: introduces a method that leverages LLM-generated graph representations and eye-tracking biomarkers to analyze reading comprehension, comparing human and LLM understanding of text.
The framework converts sentences into knowledge graphs with nodes representing entities and edges representing relationships, then uses LLMs to label important graph components.
It integrates human eye-tracking data and graph-theoretic metrics to validate LLM-derived importance labels, offering insights into cognitive processes.

Extremal Testing for Network Software using LLMs

Extremal Testing Methodology: introduces a novel approach for automating extremal testing of network software, leveraging LLMs for constraint generation and invalid test case creation, followed by execution on target software and differential testing for bug identification.
This two-step, chain-of-thought prompting strategy, where LLMs first define validity constraints and then generate violating tests, proves more effective than one-stage prompting.
The methodology successfully uncovered new bugs in DNS, HTTP, and BGP implementations, demonstrating its utility as a complement to existing software testing techniques like symbolic execution and fuzz testing.

THE EVOLVING ROLE OF LARGE LANGUAGE MODELS IN SCIENTIFIC INNOVATION: EVALUATOR, COLLABORATOR, AND SCIENTIST

Pyramidal Framework: introduces a comprehensive taxonomy for LLM roles in scientific innovation, encompassing Evaluator (low-autonomy knowledge synthesizer), Collaborator (mid-autonomy ideation engine), and Scientist (high-autonomy discovery platform) components.
This framework distinguishes LLMs' contributions to structured scientific research and open-ended scientific discovery, clarifying capability boundaries, evaluation criteria, and human-AI interaction patterns at each level.
The framework provides conceptual clarity, practical guidance, and theoretical foundations for future research in increasingly autonomous AI-driven science.

NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting

NLI4VolVis: introduces an interactive system that enables users to explore, query, and edit volumetric scenes using natural language, integrating multi-view semantic segmentation, vision-language models, editable 3D Gaussian Splatting (iVR-GS), a multi-agent LLM architecture with core and function-calling agents, memory, function-calling tools for querying, editing, question answering, view selection, and 2D stylization, a Visualization-Perception-Action (VPA) loop, and an interactive user interface, where it allows intuitive exploration and editing of volumetric datasets through open-vocabulary querying, real-time scene editing, and stylization.
The system leverages LLM multi-agents equipped with extensive function-calling tools to interpret user intents and execute visualization tasks, enhancing accessibility and usability in volumetric data exploration.
NLI4VolVis unifies editable volumetric representations, open-vocabulary scene understanding, and collaborative multi-agent LLMs to support intuitive, natural language-based volume visualization.

Topology Enhanced MARL for Multi-Vehicle Cooperative Decision-Making of CAVs

TPE-MARL (Topology Enhanced Multi-Agent Reinforcement Learning): introduces a novel multi-agent reinforcement learning method for cooperative decision-making of Connected and Autonomous Vehicles (CAVs) by designing a game topology tensor and integrating it into a QMIX-based framework.
The framework leverages a TopologyNet to compress high-dimensional traffic state information into a structured game topology tensor, enhancing learning efficiency and coordination performance in complex vehicular scenarios.
It incorporates visit counts and agent mutual information into the reward function, enabling a balance between exploration and exploitation for improved traffic efficiency, safety, and decision smoothness.

Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics

FiM (Foresight in Motion): introduces a "First Reasoning, Then Forecasting" strategy for trajectory prediction by integrating a reward-driven intention reasoner and a hierarchical DETR-like decoder.
The framework employs a query-centric Inverse Reinforcement Learning (QIRL) module to infer reward distributions and perform policy rollouts, providing intention-informed priors for trajectory generation.
It further utilizes a Bi-Mamba-enhanced decoder to capture sequential dependencies and an auxiliary Occupancy Grid Map (OGM) prediction head to improve feature fusion and prediction confidence.

Understanding visual attention beehind bee-inspired UAV navigation

PPO (Proximal Policy Optimization): introduces a Deep Reinforcement Learning framework for bee-inspired UAV navigation, utilizing a Policy Network composed of a CNN, MaxPool layers, a Flatten layer, Linear layers, and an Output layer to process optic flow observations for obstacle avoidance.
The framework trains agents in an AirSim simulation environment to navigate cluttered tunnels using only optic flow as sensory input, aiming to replicate honeybee navigation behaviors.
Explainable AI methods, specifically SHAP, are employed to analyze the attention patterns of trained agents, revealing that they focus on optic flow discontinuities and high-magnitude regions for decision-making.

IANN-MPPI: Interaction-Aware Neural Network-Enhanced Model Predictive Path Integral Approach for Autonomous Driving

IANN-MPPI (Interaction-Aware Neural Network-Enhanced Model Predictive Path Integral): introduces a real-time, fully parallelizable, interaction-aware trajectory planning framework that integrates an MPPI Controller, a Neural Network Predictor, and a Spline-based Prior, enabling complex maneuvers by predicting surrounding agent reactions to sampled control sequences.
The framework leverages the Neural Network Predictor to simulate diverse interaction outcomes based on ego vehicle candidate trajectories, while the Spline-based Prior enhances MPPI's sampling diversity for efficient lane-changing.
This approach addresses challenges in dense traffic by enabling proactive nudging of surrounding vehicles and achieving successful merging maneuvers, demonstrating improved efficiency and safety compared to non-interactive baselines.

15th July 2025

How Many Instructions Can LLMs Follow At Once?

IFScale: introduces a benchmark to evaluate LLM instruction-following performance degradation as instruction density increases, with Term Vocabulary Construction (Builds keyword set), Prompt Construction (Generates model input), Retry Logic (Manages generation failures), Evaluation Module (Assesses instruction adherence), and Coherence Check (04-mini) (Judges report quality), where it measures how LLMs adhere to a growing number of keyword-inclusion instructions in business report generation.
The benchmark evaluates 20 state-of-the-art LLMs, revealing distinct performance degradation patterns, primacy effects, and error types under high cognitive load.
Insights from the evaluation inform the design of instruction-dense prompts and highlight performance-latency tradeoffs for real-world LLM applications.

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

DrafterBench: introduces a comprehensive benchmark for evaluating LLM agents in civil engineering drawing revision, including Task Collection (summarizes real-world tasks), Tool Preparation (customizes functions/tools), Default Prompt (provides prompt framework), Evaluation Metric (assesses performance), and Dual Tools/Functions (records operation paths).
The benchmark comprises 1920 tasks across 12 types, derived from real-world drawing files, designed to assess LLM capabilities in structured data understanding, function execution, instruction following, and critical reasoning.
It utilizes dual tools to record ground operation paths for accurate performance grading and error analysis, providing insights for integrating LLMs into engineering applications.

AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air

AirLLM: introduces a hierarchical diffusion policy framework for communication-aware LoRA adaptation, including Cloud LLM for fine-tuning, Edge LLM for inference, a Wireless Channel for parameter transmission, an Environment providing state, reward, and action, and a Hybrid Policy with PPO for coarse policy generation and Diffusion Policy for fine-grained refinement.
The framework models rank configuration as a structured action vector, using a Proximal Policy Optimization (PPO) agent for coarse-grained decisions and Denoising Diffusion Implicit Models (DDIM) for high-resolution rank vector refinement.
It aims to balance LLM fine-tuning performance with transmission costs by adaptively optimizing LoRA rank assignments based on wireless states and linguistic complexity.

Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Dr.Copilot (Multi-Agent Large Language Model System): introduces a multi-agent LLM system designed to enhance doctor-patient communication quality in Romanian text-based telemedicine, including a Scorer Agent (evaluates responses), a Recommendation Agent (generates suggestions), and a Reconciliation Agent (simulates improvements).
The system leverages DSPy (prompt optimization) for automatic prompt optimization and utilizes open-weight LLMs (underlying models) served by VLLM (model serving), providing real-time feedback to doctors.
Dr.Copilot focuses on improving presentation quality rather than medical correctness, aiming to increase patient satisfaction and represents an early real-world deployment of LLMs in Romanian medical settings.

Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems

AgentOps (AI AgentOps Automation Pipeline): introduces a comprehensive framework for observing, analyzing, optimizing, and automating agentic AI systems, encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation.
The framework addresses challenges for developers, testers, SREs, and business users by taming uncertainty in LLM-powered agentic systems through automation and self-improvement.
It provides a structured approach to manage dynamic, unpredictable agent behavior, ensuring safe, adaptive, and effective operation in enterprise contexts.

An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling

MARAUS (Multi-Agent and Retrieval-Augmented University Admission System): introduces a real-world conversational AI platform for university admissions counseling, integrating a Multi-agent Coordinator (Classifies queries), Preprocessing Module (Cleans, normalizes data), Hybrid Retrieval Module (Combines semantic, keyword search), Logic Calculation Module (Performs domain-specific computations), Factual Database (Stores structured data), LLM-based Generation (Generates responses), and Post-processing Module (Formats, mitigates hallucination).
The system employs specialized agents for information search, score calculation, recommendation, and general queries, leveraging hybrid RAG with semantic and keyword retrieval, re-ranking, and LLM-based generation to enhance accuracy and reduce hallucinations.
Deployed in a real-world university setting, MARAUS processed over 6,000 user interactions, demonstrating significant improvements in accuracy and response times while operating cost-effectively.

An Agentic Flow for Finite State Machine Extraction using Prompt Chaining

FlowFSM (An Agentic Flow for Finite State Machine Extraction using Prompt Chaining): introduces an agentic framework for FSM extraction from RFC documents, utilizing an RFC Documents Processing Pipeline, FSM Extraction using Prompt Chaining, AI Agents (CrewAI), an LLM Model, and a Rulebook.
The framework systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs.
This approach decomposes complex FSM extraction into modular, interpretable steps, enhancing transparency and robustness.

Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Multi-Agent System (MAS): introduces a multi-agent framework for LLM-based deductive coding, including a Single-Agent Coding Module (individual annotation simulation), Dual-Agent Discussion Module (inter-agent discussion simulation), Consensus Agent Module (disagreement resolution, final coding), LLM Agents (perform coding tasks), Codebook (structured coding categories), Ollama API (LLM interaction interface), System Prompts (agent instruction, personality injection), and Post-processing Procedure (extracts, validates code annotations), to investigate how agent persona and temperature influence consensus and coding accuracy.
The MAS emulates human qualitative coding workflows through structured agent discussions and consensus arbitration, evaluating six open-source LLMs with varying parameters and 18 experimental configurations.
The study found that while temperature robustly delays consensus, and persona congruency has selective effects, MAS deliberation generally yields minimal accuracy gains over single-agent coding, except for specific conditions.

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

SWE-MERA: introduces a dynamic benchmark for agenticly evaluating LLMs on software engineering tasks, utilizing a seven-stage pipeline including Repository Selection (selects GitHub repositories), PR-Issue Mapping Construction (maps pull requests to issues), Metadata Extraction and Filtering (downloads and filters metadata), Patch Extraction and Validation (generates and validates git diffs), Repository Build Validation (builds environment, runs tests), End-to-End Task Execution (executes tasks in Docker), and LLM-based Pipeline Evaluation (assesses task quality).
The framework also integrates an Aider coding agent (automates scoring), a dynamic user leaderboard (displays evaluation results), Docker containers (provides controlled environment), the GitHub GraphQL API (collects data), the Hugging Face platform (hosts dataset), and an evaluation repository (receives submissions).
SWE-MERA addresses data contamination and benchmark saturation by continuously updating its dataset with new, unseen issues, ensuring real-world relevance and fair evaluation for LLMs in software development.

DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models

Voting Classifier: introduces a system for early depression detection, with Raw Data, Pre-processing Pipeline, Feature Engineering, Feature Matrix, Voting Classifier, Random Forest Classifier, Stochastic Gradient Descent Classifier, and Gradient Boosting Classifier, where it combines diverse engineered features and multiple machine learning models for classification.
This approach processes raw JSON user data through a comprehensive pre-processing pipeline to create a feature matrix, which is then fed into an ensemble of base models.
The Voting Classifier employs a soft voting strategy to aggregate predictions from its base models, aiming for robust depression detection.

General Modular Harness for LLM Agents in Multi-Turn Gaming Environments

General Modular Harness: introduces a modular design for LLM agents, with Perception Module (processes UI inputs), Memory Module (stores trajectories, reflects), Reasoning Module (integrates info, decides), and Adapter (interfaces with game), enabling a single LLM/VLM backbone to tackle diverse multi-turn gaming environments.
This harness provides a unified workflow for analyzing how each module affects performance across dynamic interactive settings in games.
Extensive experiments demonstrate consistent performance gains and reveal distinct module contributions, advancing general-purpose agent design.

MR-LDM - The Merge-Reactive Longitudinal Decision Model: Game Theoretic Human Decision Modeling for Interactive Sim Agents

MR-LDM (Merge-Reactive Longitudinal Decision Model): introduces a game-theoretic framework that models merging and lag vehicle interactions using a game-theoretic formulation with defined action sets for both actors, payoff functions incorporating a usmht function, Predictive Time Headway (PTH) metric, and Ramp End Influence Terms, alongside bounded rationality (QRE), a decision window, and an MR-IDM dynamics model.
This model explicitly generates discrete, decision-level behaviors for the lag actor, including yield behind, yield ahead, block, and do nothing, which are then executed using MR-IDM dynamics.
The framework enhances behavioral realism and controllability in traffic actor models, supporting robust evaluation of merging trajectory planners in interactive traffic scenarios.

VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization

VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment): introduces a monocular global localization framework that combines an object-based segmentation and tracking pipeline with a submap correspondence search, including Image Input, Image Auto-segmentation, Video Tracking, Visual Inertial Odometry (VIO), Structure from Motion (SfM), Environment Map (Mi), Bounding Box Submap Generation, Geometric Data Association, and Relative Rotation and Translation Estimation.
The framework generates sparse, viewpoint-invariant 3D environment representations and aligns vehicle reference frames by exploiting geometric consistencies between environment maps.
It achieves robust localization across diverse camera viewpoints and seasonal changes without domain-specific training, maintaining a compact object-based map for real-time performance.

Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander

Vision-Language Model-Based Commander: introduces a framework for multi-UGV confrontation, integrating a Perception VLM (scene understanding) and a Decision LLM (strategic planning), with an Expert System (training supervision) for semantic alignment.
The framework reconstructs perception-to-decision as a language-based cognitive process, achieving unified perception and decision within a shared semantic space.
This approach, validated through simulations, demonstrates strong adaptability, interpretability, and a win rate over 80% compared to baseline models.

14th July 2025

Logic-layer Prompt Control Injection (LPCI): A Novel Security Vulnerability Class in Agentic Systems

LPCI (Logic-layer Prompt Control Injection): introduces a novel security vulnerability class targeting LLM agent architecture, with Prompt Ingestion Layer (captures user inputs), Memory Context Handler (manages memory states), Logic Execution Engine (interprets prompts, executes logic), Tool/Plugin Interface (facilitates external actions), and Output Dispatcher (manages output delivery), exploiting persistent memory and logic execution layers.
LPCI attacks embed encoded, delayed, and conditionally triggered payloads in memory or vector stores, bypassing conventional input filters and triggering unauthorized behavior across sessions.
The paper demonstrates LPCI feasibility across multiple LLM platforms and proposes runtime security controls to mitigate these vulnerabilities.

Prompt-Informed Reinforcement Learning for Visual Coverage Path Planning

PIRL (Prompt-Informed Reinforcement Learning): introduces a novel approach for visual coverage path planning using a UAV, integrating an LLM (GPT-3.5) and a PPO RL policy with components including Current UAV State, Structured LLM Prompt with UAV State, LLM Recommendation for Next UAV State, PARE, Action-based Reward for PPO, PIRL-based Reward for PPO, PPO Action, PPO RL policy, and Next UAV State.
The framework leverages the LLM's zero-shot reasoning and in-context learning to dynamically shape the reward function for the PPO agent via the PARE module.
PIRL guides the RL agent's position and camera adjustments for optimal visual coverage by combining standard RL rewards with LLM-based semantic feedback.

Toward Real-World Table Agents: Capabilities, Workflows, and Design Principles for LLM-based Table Intelligence

LLM-based Table Agents: introduces a survey focusing on automating table-centric workflows by integrating preprocessing, reasoning, and domain adaptation, with Table Structure Understanding (Formatting tables), Table and Query Semantic Understanding (Handling noise and ambiguity), Table Retrieval and Compression (Compressing or selecting tables), Executable Reasoning with Traceability (Generating verifiable steps), and Cross-Domain Generalization (Adapting to new domains).
The paper identifies these five core capabilities as essential for LLM-based agents to handle real-world table tasks involving noise, structural heterogeneity, and semantic complexity.
The survey reviews current methodologies for these capabilities and outlines future research directions for developing more robust, efficient, and generalizable agents.

Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Cultural Moral Framework Evaluation: evaluates cultural bias in LLMs using the MFQ-2 across cultural contexts, employing cultural persona prompting and synthetic population generation, comparing results to human baseline data using analysis methods.
The study finds that current LLMs tend to homogenize moral diversity across cultures, failing to accurately represent nuanced, culturally-specific moral intuitions.
The findings highlight limitations in current AI alignment approaches and the use of LLMs as synthetic populations in social science research.

The Man Behind the Sound: Demystifying Audio Private Attribute Profiling via Multimodal Large Language Model Agents

Gifts (hybrid multi-agent framework): introduces a framework to profile sensitive personal attributes from audio data using its LLM agent (Guides ALM, scrutinizes, consolidates), ALM agent (Infers attributes, answers questions), Guidance (LLM instructs ALM), Inference (ALM infers attributes), Forensics (LLM questions, ALM answers), Scrutinization (LLM evaluates ALM inference), and Consolidation (LLM aggregates results) components.
The framework leverages the strengths of LLMs and Audio-Language Models (ALMs) through a multi-phase process to enhance attribute inference capabilities from audio.
Gifts significantly outperforms baseline approaches in inferring sensitive attributes from audio, highlighting a privacy risk and providing a framework for further research and defense strategies.

LLM-Guided Agentic Object Detection for Open-World Understanding

LAOD (LLM-Guided Agentic Object Detection): introduces an LLM-guided agentic object detection framework that autonomously generates scene-specific object names using an LLM (Large Language Model) from an input image, which are then passed as generated labels to an OVOD (Open-Vocabulary Object Detector) for object localization, producing detected objects.
This framework enables fully label-free, zero-shot detection, adapting its perception goals dynamically without manual prompt engineering or predefined vocabularies.
The method enhances autonomy and adaptability for open-world understanding by tightly coupling language-based reasoning with visual grounding.

Semantic Context for Tool Orchestration

SC (Semantic Context): introduces a novel approach for robust tool orchestration, leveraging descriptive tool information to enhance learning efficiency and adaptation in dynamic action spaces.
The paper theoretically and empirically validates SC's benefits through the SC-LinUCB algorithm and demonstrates its critical role in dynamic adaptation for LLMs.
Furthermore, the FiReAct pipeline, which utilizes SC for semantic filtering and LLM-based reasoning, enables practical tool orchestration at scale with over 10,000 tools.

Warehouse Spatial Question Answering with LLM Agent 1st Place Solution of the 9th AI City Challenge Track 3

LLM Agent System: introduces a data-efficient approach for warehouse spatial question answering, integrating a Spatial Reasoning LLM, Light-weight Perception Models, Spatial Calculation Functions, an API Tools Interface, Multi-turn Execution, a Rule-based Parser, and Structured Message History.
The system leverages a reasoning LLM (Gemini 2.5-Flash) with function-calling capabilities to conduct complex spatial reasoning and interact with various tools for object retrieval, counting, and distance estimation.
This approach achieved first place in the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse benchmark, demonstrating high accuracy and efficiency in complex indoor scenarios.

Exploring User Security and Privacy Attitudes and Concerns Toward the Use of General-Purpose LLM Chatbots for Mental Health

Harm-Reduction Framework (conceptual recommendations for LLM-enabled chatbots): introduces a conceptual framework to safeguard user mental health disclosures with general-purpose LLM-enabled chatbots, including contextual nudges & just-in-time warnings (Dynamic S&P responses), strong default protections and ephemeral storage (Default privacy settings), and targeted oversight and audits (Third-party data review), aiming to address user security and privacy concerns.
The paper identifies critical user misconceptions and a general lack of risk awareness regarding data handling, privacy, and regulatory protections when using LLMs for mental health support.
It highlights the concept of 'intangible vulnerability,' where emotional disclosures are undervalued compared to tangible data, necessitating architectural safeguards and legislative frameworks.

From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents

Functional Taxonomy for Web of Agents Architectures: introduces a comprehensive evolutionary overview of the Web of Agents, with Semantic Foundation (establishes shared understanding), Communication Paradigm (classifies message exchange style), Locus of Intelligence (identifies core reasoning location), and Discovery Mechanism (defines how agents find each other) components, providing a unified analytical lens for comparing agent architectures across generations.
This taxonomy reveals a fundamental paradigm shift in the 'locus of intelligence' from external data or platforms to being embedded within the agent's core LLM, enabling scalable and adaptive WoA systems.
The paper highlights that while new protocols like MCP and A2A are essential, they are insufficient for building a robust, open, and trustworthy ecosystem, mapping out a new agenda focused on socio-technical challenges like decentralized identity, economic models, security, and governance.

Game Theory Meets LLM and Agentic AI: Reimagining Cybersecurity for the Age of Intelligent Threats

LLM-based Multi-Agent Systems (MAS) for Cybersecurity: introduces a framework for designing adaptive cyber systems by integrating game theory with LLM-driven agentic AI, featuring Chain, Star, Parallel, Feedback, and Hybrid workflows, each composed of LLM Agents.
This framework leverages LLMs as reasoning engines and generative policy mechanisms to overcome limitations of classical game theory, enabling dynamic, context-aware interactions among agents.
MAS workflows enhance robustness and resilience in cybersecurity by supporting architectural redundancy, inter-agent verification, and adaptive learning in adversarial environments.

Architecting Human-AI Cocreation for Technical Services – Interaction Modes and Contingency Factors

Six-Mode Taxonomy of Human-Agent Collaboration: introduces a comprehensive framework for designing human-agent systems, detailing six distinct interaction modes: Human-Augmentation-Mode (HAM), Human-in-Command (HIC), Human-in-the-Process (HITP), Human-in-the-Loop (HITL), Human-on-the-Loop (HOTL), and Human-Out-of-the-Loop (HOOTL).
The framework maps these modes to a standard process flow, illustrating the division of labor between human and AI agents across tasks like data gathering, solution formulation, and approval.
It provides actionable design guidance by connecting each mode to key contingency factors such as task complexity, operational risk, and system reliability, aiding practitioners in navigating automation-control trade-offs.

Semantic Segmentation based Scene Understanding in Autonomous Vehicles

FPN-EfficientNet: introduces a novel compound model for semantic segmentation in autonomous vehicles, utilizing Feature Pyramid Networks with an EfficientNet backbone, evaluated on the BDD100k dataset.
The model employs an encoder-decoder structure, incorporating convolutional and pooling layers, batch normalization, and various activation functions to achieve pixel-level scene understanding.
Transfer learning is applied to leverage pre-trained knowledge, and the model's performance is optimized using specific loss functions within the Tensorflow/Keras environment.

AI-Powered Math Tutoring: Platform for Personalized and Adaptive Education

AI-Powered Math Tutoring Platform: introduces a novel multi-agent AI tutoring platform that combines adaptive and personalized feedback, structured course generation, and textbook knowledge retrieval to enable modular, tool-assisted learning processes.
This system utilizes a multi-agent architecture with a central Tutor Agent orchestrating interactions, supported by specialist agents for research, planning, and course creation, and a dual-memory framework for personalization.
It integrates Retrieval-Augmented Generation (GraphRAG) for contextual textbook knowledge and various tools like a symbolic solver and function plotter to facilitate deep understanding and independent problem-solving.

PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training

PRM-Free Security Alignment Framework: introduces a novel approach for LLM security alignment, leveraging automated red teaming and adversarial training to achieve robust security guarantees without Process Reward Models.
This framework systematically identifies vulnerabilities through sophisticated attack strategies and enhances model robustness via targeted adversarial training, significantly reducing computational costs by 61% compared to PRM-based methods.
It incorporates transparent reporting and continuous audit mechanisms, democratizing access to robust security measures for resource-constrained organizations and providing a scalable foundation against evolving adversarial threats.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

ExCyTIn-Bench: introduces a benchmark for evaluating LLM agents on cyber threat investigation, featuring a Graph Builder (constructs threat graphs), a QA Generator (generates questions/answers), an ExCyTIn Playground (interactive environment), an LLM Agent (investigates cyber threats), a MySQL Environment (provides log data), and an LLM Evaluator (assesses agent performance).
The benchmark leverages real-world security logs from a controlled Azure tenant to build bipartite alert-entity graphs, enabling the automatic generation of diverse and explainable question-answer pairs for agent evaluation.
It provides a standardized interactive environment where LLM agents query a MySQL database to solve multi-hop investigation tasks, with fine-grained reward calculation for intermediate steps.

Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System

SMACS (Scalable Multi-Agent Collaboration System): introduces a scalable multi-agent collaboration system that leverages prior selection and posterior enhancement to enable open-source LLMs to outperform closed-source LLMs, with all Unified Question Bank (stores questions and LLM performance), LLM Bank (pool of heterogeneous LLMs), Pre-establish (evaluates LLMs on question bank), Retrieval-based Prior Selection (RPS) (selects top-k LLMs), Question Embedding (embeds input question), LLM Performance Matrix (stores LLM performance), Retrieved Similarity Vector (represents retrieved question similarity), Selected Top-k LLMs (output of RPS), Exploration-Exploitation-Driven Posterior Enhancement (EPE) (generates and selects responses), Prior Dropping (forms answer subsets), LLM Aggregator (synthesizes multiple responses), Similarity Score Computation (computes mean pairwise similarity), Perplexity Score Computation (computes perplexity), Hybrid Posterior Score (combines similarity and perplexity), where the framework integrates prior and posterior information to generate diverse, high-quality responses.
The system utilizes a Unified Question Bank and an LLM Bank, employing Retrieval-based Prior Selection to select optimal LLMs and Exploration-Exploitation-Driven Posterior Enhancement with an LLM Aggregator to refine responses.
This framework demonstrates scalability and superior performance across various benchmarks by effectively combining the strengths of multiple LLMs.

RCG: Safety-Critical Scenario Generation for Robust Autonomous Driving via Real-World Crash Grounding

RCG (Real-world Crash Grounding): introduces a scenario generation framework that integrates crash-informed semantics into adversarial perturbation pipelines, utilizing a Behavior Embedding Space, Encoder, Decoder, Fully-Connected (FC) Layer, LoRA Adapter, Reconstruction Loss, and Prototypical Contrastive Learning (PCL) Objective to create safety-critical scenarios.
The framework constructs a safety-aware behavior representation by pre-training on large-scale driving logs and fine-tuning on a crash-rich dataset, leveraging an Unsafe Embedding Cache, Trajectory Predictor, and k-NearestNeighbors (KNN) Distance for adversarial selection.
RCG guides the Adversary Agent's behavior via a Perturbation Function to maximize realistic criticality against the Ego Agent within a Base Scenario, leading to more plausible and effective stress testing for autonomous driving systems.

BRIDGING BRAINS AND MACHINES: A UNIFIED FRONTIER IN NEUROSCIENCE, ARTIFICIAL INTELLIGENCE, AND NEUROMORPHIC SYSTEMS

Unified Frontier: introduces a research paradigm bridging neuroscience, artificial intelligence, and neuromorphic computing, with four pillars: Co-Design of Brains, Algorithms and Hardware, Hybrid Learning Pipelines, Hierarchical Memory and Sensorimotor Grounding, and Standardization and Benchmarking.
The paper surveys foundational milestones, recent advances, and conceptual mismatches across these domains, highlighting cross-inspiration and convergence points like synaptic plasticity, sparse spike-based communication, and multimodal association.
It proposes an integrated roadmap outlining open challenges and future directions for biologically-grounded AGI and next-generation neuromorphic hardware, emphasizing energy efficiency, real-time adaptation, and ethical considerations.

Large Population Models

Large Population Models (LPMs): introduces a novel approach to simulate complex societal systems at scale, integrating Million-scale Agent-based Simulation (simulates millions agents), Differentiable Agent-based Simulation (enables gradient learning), and Decentralized Agent-based Simulation (securely deploys simulations) to overcome traditional agent-based model limitations.
This framework, implemented by AgentTorch, enables efficient simulation of millions of agents, end-to-end differentiable learning from diverse data streams, and privacy-preserving integration with real-world systems.
LPMs provide a robust platform for understanding collective intelligence, evaluating policies, and testing social innovations before real-world deployment, as demonstrated in a COVID-19 case study for New York City.

Multi-residual Mixture of Experts Learning for Cooperative Control in Multi-vehicle Systems

MRMEL (Multi-residual Mixture of Experts Learning): introduces a novel framework for Lagrangian traffic control, with vehicle observations and MDP context fed into a Policy F_e (actor/policy network) that outputs nominal weights for Nominal policies and a Residual action, which are combined to form the Final action, while a Value function (critic network) estimates Value to maximize Max E[R], all operating within Training threads involving Autonomous Vehicles and Human-driven vehicles in diverse Traffic scenarios.
MRMEL augments a suboptimal nominal AV control policy by learning a residual correction and dynamically selecting the most suitable nominal policy from a pool of nominal policies conditioned on the traffic scenarios, modeled as a mixture of experts.
The framework is designed for generalizable multi-vehicle control, demonstrating superior performance in cooperative eco-driving by reducing aggregate vehicle emissions across diverse real-world traffic scenarios.

13th July 2025

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

TinyTroupe: introduces a simulation toolkit enabling detailed persona definitions and programmatic control via LLM-driven mechanisms, including Agents (LLM-powered entities), Environments (simulation context), Factories (generate agent specifications), Validators (assess agent quality), Propositions (define verifiable claims), Simulation Steering (guide simulation flow), Information Processing (extract/enrich/export data), Caching (preserve simulation state), Control (overall simulation management), Experimenter (user interaction), Simulation Core (central simulation engine), Data (data handling components), Action Generation (produce agent actions), Mental Faculties (agent cognitive abilities), and Tools (simulated agent tools).
The toolkit is designed for realistic human behavior simulation using LLM-powered multiagent systems with a focus on detailed persona specifications.
It provides a comprehensive set of utilities for specifying scenarios, running simulations, extracting data, and validating results, supporting an experiment-oriented workflow.

Negotiating Comfort: Simulating Personality-Driven LLM Agents in Shared Residential Social Networks

Personality-Driven LLM Agents Simulation: introduces a methodology integrating generative agents powered by LLMs into social network simulations, with Generative Agents (simulated entities), LLM (decision engine), Social Network (simulated relationships), Crowd framework (simulation platform), Environment (external factors), Agent Memory (stores agent data), Agent Reflection (processes past experiences), Agent Planning (determines future steps), Agent Actions (decisions and interactions), Family Members (within-family agents), and Family Representatives (building-level agents).
The approach simulates personality-driven decision-making regarding central heating temperature in a shared residential building.
The simulation uses the Crowd framework for execution and visualization, modeling agent interactions and decisions based on personality, preferences, and social ties.

THOR: Transformer Heuristics for On-Demand Retrieval An LLM Solution Enabling Conversation with Relational Databases by eSapiens

THOR (Transformer Heuristics for On-Demand Retrieval): introduces a multi-agent Text-to-SQL framework with a Supervisor Agent (Routes queries, interprets task), SQL Generation Agent (Converts NL to SQL), Self-correction Module (Regenerates SQL on failure), and Result Interpretation Agent (Analyzes data, generates insights).
The framework uses LLMs for SQL generation, self-correction, and result interpretation, operating on database schema and executing queries against a SaaS database.
A key feature is the self-correction loop, which retries SQL generation based on execution errors or low-quality results, enhancing robustness.

eSapiens: A Platform for Secure and Auditable Retrieval-Augmented Generation

eSapiens: introduces a platform with Knowledge Adaptation Layer (data processing), Storage Layer (data storage), Application Logic Layer (agent orchestration), DEREK Engine (unstructured data QA), and THOR Module (structured data QA).
It employs a multi-agent architecture orchestrated via LLM Frameworks like LangChain/LangGraph for retrieval-augmented generation and natural language analytics over diverse enterprise data.
The platform provides secure, auditable AI workflows, integrating data connectors, prompt management, and robust security features for enterprise use cases.

AICRYPTO: A COMPREHENSIVE BENCHMARK FOR EVALUATING CRYPTOGRAPHY CAPABILITIES OF LARGE LANGUAGE MODELS

Agent-based framework: introduces, with LLM Agent, Environment, Task Prompts, Response Format, Action Types, Available Tools, Execution Environment, Feedback, Helper Scripts, and Challenge Files, a system for evaluating LLMs on CTF challenges through iterative interaction.
The framework allows the LLM Agent to perform actions like executing commands or creating files within a controlled Execution Environment using Available Tools.
The Environment provides Feedback to the LLM Agent, enabling multi-step reasoning and problem-solving towards recovering the flag from Challenge Files, guided by Task Prompts and structured Response Format.

Evaluating LLMs on Sequential API Call Through Automated Test Generation

StateGen (Automated Test Case Generation Framework): introduces an automated framework to generate diverse coding tasks involving sequential API interactions, with Trace Generation, TraceGenerator, State Schema, API Compatibility Checking, Energy-based Sampling, Program Generation, Control Flow Injection, Instruction Translation, Multi-agent System, Generator Agent, Evaluator Agent, Oracle Generation, and Local Execution Environment, designed to evaluate LLMs' ability in understanding sequential API calls and managing associated program states.
The framework follows a reverse-generation strategy, starting with executable API sequences, adding control flow, and translating them into natural language instructions using a multi-agent system.
StateGen is used to construct StateEval, a benchmark of 120 verified test cases across three scenarios, highlighting areas for improvement in current LLMs incorporating APIs.

Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs

RAG-Reasoning System: introduces a survey of systems integrating retrieval and reasoning in LLMs, including Reasoning-Enhanced RAG (Reasoning improves RAG stages), RAG-Enhanced Reasoning (Retrieval improves reasoning), and Synergized RAG-Reasoning (Iterative retrieval and reasoning) with various Reasoning Workflow (Structured reasoning process) and Agent Orchestration (How agents interact) strategies.
The survey categorizes approaches into three evolutionary stages, highlighting how reasoning can enhance RAG stages and how retrieval can enhance LLM reasoning.
Synergized RAG-Reasoning systems, particularly agentic ones, iteratively interleave retrieval and reasoning to achieve state-of-the-art performance on knowledge-intensive tasks.

IteraOptiRacing: A Unified Planning-Control Framework for Real-time Autonomous Racing for Iterative Optimal Performance

IteraOptiRacing (Unified Planning-Control Framework): introduces a unified planning-control strategy for autonomous racing, integrating Data Collection (gathers historical data), Online Optimization (real-time planning, control), Historical Data Set (stores past performance), Target Terminal Set Construction (generates future states), Surrounding Vehicle Perception (identifies dynamic obstacles), Affine Dynamics Model (approximates vehicle dynamics), Iterative LQR Solver (optimizes trajectories), Trajectory Selection (chooses optimal path), Vehicle Control (applies control inputs), and Vehicle Simulator (simulates racing environment).
The framework leverages iterative optimization based on historical data and an affine time-varying vehicle model to generate collision-free and time-optimal trajectories for the ego vehicle in dynamic multi-car racing environments.
This approach ensures smooth overtaking maneuvers and improved lap time performance by avoiding nonsmooth transitions between time-optimal and overtaking controllers, validated through high-fidelity simulations.

TruckV2X: A Truck-Centered Perception Dataset

TruckV2X: introduces a truck-centered cooperative perception dataset, featuring multi-modal sensing (LiDAR, cameras, IMU units) and multi-agent cooperation (tractor, trailer, CAV, RSU), generated in CARLA/Unreal Engine, and benchmarked using the OpenCOOD framework.
This dataset addresses the scarcity of heavy-duty vehicle data for cooperative perception, focusing on unique challenges like extensive blind spots and occlusions caused by large truck size and dynamic trailer movements.
The research establishes performance benchmarks for cooperative perception tasks, demonstrating the critical value of truck-specific viewpoints for enhanced occlusion handling and advancing autonomous trucking systems.

GenAI-based Multi-Agent Reinforcement Learning towards Distributed Agent Intelligence: A Generative-RL Agent Perspective

GenAI-MARL (Generative AI-based Multi-Agent Reinforcement Learning): introduces a paradigm shift for multi-agent systems by leveraging generative models for environment dynamics modeling, action policy modeling, and integrated prediction and planning, enabling proactive decision-making and sophisticated coordination.
This approach addresses limitations of conventional MARL by tackling the curse of dimensionality, non-stationarity, and partial observability through learning compact representations, anticipating policy evolution, and inferring hidden states.
The framework aims to foster distributed agent intelligence, enabling agents to synthesize realistic multi-agent scenarios, predict behaviors, and generate complex coordination strategies for enhanced collective performance.

12th July 2025

Knowledge Conceptualization Impacts RAG Efficacy

Agentic graph-RAG system: introduces an approach leveraging an Agentic Processor, LLM, Knowledge Graphs Pool, Schema Injection, and User Dialogue to generate SPARQL queries from natural language competency questions.
The system focuses on how injecting knowledge graph schemas into the LLM's context impacts its ability to generate semantically and syntactically correct queries.
The research evaluates the efficacy of this system by varying schema complexity and representation formats across different knowledge graphs.

When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents

OpenHands Framework: introduces a systematic security analysis of LLM-based coding agents using the OpenHands Framework (AI agent platform) powered by various LLM Backends (Specific LLMs powering agent) on the SetupBench Benchmark (Software setup task benchmark), employing a Detection System (Prompt-based insecure action classifier) and evaluating Mitigation Strategies (Methods to reduce insecure behavior).
The paper evaluates the security posture of autonomous coding agents by analyzing over 12,000 actions across five state-of-the-art LLMs on 93 real-world software setup tasks.
Findings reveal significant security concerns, with 21% of agent trajectories containing insecure actions, and demonstrate varying effectiveness of mitigation strategies like feedback mechanisms and security reminders.

StockSim: A Dual-Mode Order-Level Simulator for Evaluating Multi-Agent LLMs in Financial Markets

STOCKSIM: introduces a dual-mode order-level simulator for evaluating multi-agent LLMs in financial markets, featuring an Exchange Simulation Engine, Data Sources, Agents (including LLM and specialist roles), and an Evaluator, communicating via RabbitMQ.
The simulator offers both detailed order-level and aggregated candlestick-level execution modes to capture realistic market dynamics for LLM evaluation.
The framework supports multi-agent LLM coordination, integrates external data, and provides tools for analyzing LLM trading behavior and performance.

Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony—a Decentralized Multi-Agent System

Hide-and-Shill: introduces, "a novel Multi-Agent Reinforcement Learning (MARL) framework for decentralized manipulation detection", with Shiller Agent (Generates manipulative discourse), Follower Agent (Simulates user engagement), Detector Agent (Identifies manipulative discourse), LLM Text Encoder (Extracts text features), GNN User Encoder (Processes user network), TCN Market Encoder (Processes market data), Multi-Modal Fusion Module (Combines features), State Representation (Comprehensive input vector), Action Space (Binary manipulation prediction), Reward Function (Market-grounded, attention-cost), Group Relative Policy Optimization (GRPO) (Optimizes detector policy), Manipulation Probability Prediction (Outputs prediction score), KOL Trust Accumulation Module (Stores detection results), TrustScore Calculation (Computes KOL trust), KOL Profile Updater (Updates KOL profiles), Multi-Agent Simulation Environment (Simulates interactions), Market Response Model (Simulates price changes), Real-World Data Integration (Calibrates simulation), Regulatory Sandbox Dynamic Thresholding (Application layer component), Symphony (Decentralized architecture), where the framework models manipulation detection as a dynamic adversarial game using MARL.
The framework integrates GRPO for stable learning in sparse reward environments and a theory-grounded reward function capturing the causal link between discourse and asset behavior.
The multi-modal agent pipeline fuses LLM-based semantic features, social graph signals, and on-chain market data for informed decision-making and is integrated within the Symphony decentralized system.

AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data

AInsight: introduces a system for augmenting expert decision-making, with Interactive UI (Displays information), Conversation Processing Pipeline (Processes conversation), Audio Transcription Module (Transcribes audio), Information Extraction Module (LLM agent extracts elements), Insight Generation Module (LLM agent generates insights), Retrieval Module (Retrieves data), Knowledge Base (Stores historical data), Vector Database (Stores embedded data), and Embedding Model (Embeds text), designed to provide on-the-fly insights grounded in historical data during synchronous conversations.
The system continuously monitors conversations, extracts key information, retrieves relevant data from a knowledge base, and generates concise insights presented via a conversational user interface.
Leveraging a retrieval-augmented generation pipeline built around LLM agents and a vector database, AInsight aims to improve expert decisions in high-stakes domains like healthcare by making historical data accessible in real-time.

Learning from Synthetic Labs: Language Models as Auction Participants

Synthetic Lab Framework: introduces a novel synthetic data-generating process using LLM Agents (simulated bidders) within a Simulated Auction Environment (various formats), driven by a Simulation Procedure (multi-round process) and a Prompting System (rules, history, interventions), with Data Collection (bids, outcomes, profits) for analysis.
The framework simulates various auction formats, including sealed-bid, clock, and eBay-style auctions, allowing LLM agents to participate as bidders.
The simulation procedure incorporates a "plan-bid-reflect" loop and uses structured prompting to guide LLM agent behavior and collect experimental data.

Emergence of Hierarchical Emotion Organization in Large Language Models

Emotion Tree Construction Algorithm: introduces a novel method to uncover hierarchical emotion organization in LLMs by analyzing probabilistic dependencies between emotional states in model outputs.
This algorithm utilizes GPT-40 for scenario generation, Llama models for emotion recognition, and a matching matrix to infer emotion trees, revealing how LLMs organize emotions hierarchically.
The research also investigates LLM biases in emotion recognition across diverse demographic personas, finding alignment with human systematic biases.

11th July 2025

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities.

Gemini 2.X model family: introduces a new generation of natively multimodal LLMs, including Gemini 2.5 Pro and Flash, with advanced reasoning, multimodality, long context, and next-generation agentic capabilities, built on sparse Mixture-of-Experts (MoE) transformers and featuring an inference-time "Thinking" capability.
The paper details the architecture of an AI Agent system built on these models, comprising an Agentic Core, Persistent Memory & Context, Game I/O, and Agentic Tools, demonstrated through its application in playing Pokémon Blue.
The research also evaluates the models' safety and security, including their resilience against indirect prompt injection attacks from external services and attackers, and their performance across various coding, reasoning, and multimodal benchmarks.

elsciRL: Integrating Language Solutions into Reinforcement Learning Problem Settings

elsciRL: introduces an open-source Python library for applying language solutions to reinforcement learning problems, including Config, Data Engine, Adapter (Language Adapter), Observation Samples, Extra Graphs, LLM Language State Generator, LLM Planner, LLM Validation, LLM Reflection, Encoders, Analysis, Instruction Following, User Interface (GUI), RL Agents, Environment Interaction, Evaluation, Experiment, and Results components.
The framework extends the LASIF methodology by integrating LLMs for language state generation, planning, validation, and reflection, and provides a GUI for user interaction.
It aims to facilitate the evaluation of language solutions on reward-based environments with minimal setup, demonstrating potential performance improvements for RL agents using LLM-based instruction following.

Introspection of Thought Helps AI Agents

INoT (Introspection of Thought): introduces a novel AI Agent Reasoning Framework that uses PromptCode, an LLM-read code within the prompt, to enable LLMs to execute programmatic dialogue reasoning processes internally.
The framework transfers the self-denial and reflection process from outside the LLM to inside, reducing token cost and improving performance on various tasks.
INoT's prompt is structured in XML and includes modules for PromptCode definition, image augmentation (for MLLMs), and a reasoning module that simulates a multi-agent debate internally using virtual agents.

Agentic Large Language Models for Conceptual Systems Engineering and Design

MAS: introduces a structured multi-agent system for conceptual engineering design, with Extractor Agent, Supervisor Agent, Generator Agent, Coder Agent, Reflector Agent, Ranker Agent, Meta-Reviewer Agent, Orchestrator Agent, Worker Agent, and Design-State Graph (DSG), enabling automated requirements decomposition, subsystem mapping, and runnable physics model generation.
The system utilizes a JSON-serializable Design-State Graph (DSG) to represent the evolving design knowledge, bundling requirements, physical embodiments, and numerical models.
The MAS workflow follows a structured progression of agents, with optional research loops managed by the Orchestrator and Worker agents.

AGENTSNET: Coordination and Collaborative Reasoning in Multi-Agent LLMs

AGENTSNET: introduces a multi-agent benchmark, with Agents (LLMs instantiated as nodes), Network Topology (communication graph connecting agents), Message-Passing Protocol (synchronous neighbor-to-neighbor communication), Tasks (distributed computing problems), and Evaluation (metrics for task completion), designed to measure coordination and collaborative reasoning in multi-agent LLM systems.
The benchmark uses fundamental distributed computing problems like Coloring, Vertex Cover, Matching, Leader Election, and Consensus as tasks for the agent network.
Agents communicate via a synchronous message-passing protocol over various graph topologies to solve these collaborative problems.

Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data

LOGIC (Lewis Communication Game for Image Captioning): introduces a multi-agent reinforcement learning game with a Speaker (Generates message/caption) and a Listener (Identifies image from distractors) to learn unsupervised image captioning.
The Speaker Model (M1) uses a Vision Encoder (Processes input image for Speaker) and Language Decoder (Generates natural language message), while the Listener Model (M2) uses a Vision Encoder (Processes images for Listener), Language Encoder (Processes message for Listener), and Decoder (Outputs probability distribution over images).
The framework trains these agents in a cooperative common-reward setting using a policy gradient algorithm to emerge a communication strategy for image captioning.

Unlocking Speech Instruction Data Potential with Query Rewriting

Query Rewriting Framework with Multi-LLM Knowledge Fusion: introduces a method to construct high-quality speech instruction datasets using Query Rewriting LLMs (rewrite text instructions), Speech Style LLM (generate speech style descriptions), TTS Model (synthesize speech), Multi-agent Annotation/Validation Module (evaluate synthesized speech quality), and Knowledge Fusion Module (correct failed rewrites).
The framework leverages multiple LLMs for rewriting and knowledge fusion, multiple ASR and embedding models for multi-agent validation, and a TTS model for speech synthesis.
This approach enables automated dataset construction by transforming text instructions for better TTS compatibility and validating synthesized speech quality without human annotation.

To Trade or Not to Trade: An Agentic Approach to Estimating Market Risk Improves Trading Decisions

Agentic Approach: introduces an agentic system using LLMs to iteratively discover stochastic differential equations for financial time series and inform daily trading decisions.
The system includes risk analyst agents for model discovery and risk metric generation, and trader agents that use these metrics along with news context.
Evaluation shows that model-informed trading strategies outperform standard LLM-based agents, improving Sharpe ratios.

Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

Simulated Decision Conference System: introduces a multi-agent system simulating decision conferences, including moderator (guides process), participants (debate, provide perspectives), and a judge agent (detects agreement) for agreement detection.
The system utilizes LLM agents for each role, enabling structured debate, perspective sharing, and automated agreement detection among participants.
The judge agent's performance in detecting agreement is evaluated using objective benchmarks and subjective LLM-as-a-judge methods.

A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

LLM Techniques Taxonomy: introduces a classification of methods for applying LLMs in discipline-specific research, including Continued Pre-training (Deepen domain expertise), Supervised Fine-tuning (Adapt to specific tasks), Reinforcement Learning from Human Feedback (Align with human preferences), Prompt Engineering (Guide model responses), Retrieval-Augmented Generation (Integrate external knowledge), Agent-based Methods (Interact with environment), and Tool-use Integration (Use external tools).
The survey categorizes techniques into Internal Knowledge Optimisation and External Interaction and Collaboration to address domain-specific challenges and enhance LLM performance.
It examines the application of these techniques across various scientific and humanities disciplines, highlighting potential and challenges.

Multi-Agent LLMs as Ethics Advocates in AI-Based Systems

MALEA (Multi-Agent LLM Ethics-Advocate framework): introduces, "a framework for generating ethics requirements drafts", with Requirements Engineer Agent (generates/refines requirements), Quality Inspector Agent (assesses requirement quality), Ethics Advocate Agent (critiques ethical issues), and Documentation Agent (prints final requirements), where "the framework leverages multi-agent LLMs to elicit and refine ethics requirements".
The framework operates through iterative feedback loops between the LLM-based agents to improve the quality and ethical considerations of the generated requirements.
This multi-agent approach aims to automate the initial drafting of ethics requirements to support their early integration into the software development process.

Exploring Design of Multi-Agent LLM Dialogues for Research Ideation

Structured Ideation-Critique-Revision Framework: introduces a multi-agent LLM dialogue system for research ideation, including LLM Agent (Ideator/Proposer) (Generates initial ideas), LLM Agent (Critic) (Critiques generated ideas), and LLM Agent (Reviser) (Revises ideas based on critiques).
The framework operates as an iterative cycle where LLM agents generate, critique, and refine research ideas based on seed topics and retrieved papers.
The study empirically evaluates how varying agent diversity, parallelism, and interaction depth within this framework impacts the novelty and feasibility of generated ideas.

What Factors Affect LLMs and RLLMs in Financial Question Answering?

Evaluation Framework: introduces an investigation into factors affecting LLMs and RLLMs in financial question answering, utilizing Prompting Methods, Agentic Frameworks, Multilingual Alignment Methods, LLMs, RLLMs, Long CoT, FAMMA, and Basic Txt dataset.
The study evaluates the impact of various methods and frameworks on five LLMs and three RLLMs using the FAMMA benchmark.
Findings indicate that methods effective for LLMs often simulate Long CoT, while RLLMs' inherent Long CoT capabilities limit further enhancement from conventional methods.

CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation

CRMAgent: introduces a multi-agent LLM system for e-commerce CRM message template generation, with ContentAgent (diagnoses templates), RetrievalAgent (retrieves exemplars), TemplateAgent (rewrites templates), and EvaluateAgent (assesses quality).
The system improves underperforming CRM messages using historical data, high-quality examples, and a rule-based fallback.
CRMAgent integrates content diagnosis, retrieval-based adaptation, and rule-based generation strategies across its specialized agents.

Agent Safety Alignment via Reinforcement Learning

Unified Safety-Alignment Framework: introduces a method to train LLM agents with Tools, using a Sandbox Environment and Reinforcement Learning guided by a Taxonomy and Reward Function to create a Policy-driven Decision Model.
The framework addresses both user-initiated and tool-initiated threats by classifying inputs and outputs into benign, malicious, or sensitive categories.
Training in a sandboxed environment with calibrated rewards enables agents to execute benign tasks, refuse malicious inputs, and seek verification for sensitive actions, balancing safety and utility.

Infinite Video Understanding

Infinite Video Understanding Vision: introduces a research objective for models to continuously process, understand, and reason about video data of arbitrary duration, with all Encoder (Processes incoming video), Persistent Memory System (Stores long-term knowledge), Memory Consolidation (Updates persistent memory), Query-Aligned Retrieval (Accesses relevant memory), Streaming/Incremental Processing (Handles continuous data flow), Hierarchical/Adaptive Representations (Multi-resolution data encoding), Event-Centric Understanding (Focuses on events/relationships), Agentic Reasoning (LLM plans and uses tools), and Multimodal Processing (Integrates diverse data types) components.
The vision necessitates fundamental innovation in system architecture, memory management, data representation, processing paradigms, and evaluation methodologies.
Achieving this capability requires overcoming challenges like context window limitations, memory burdens, information loss, and maintaining temporal coherence over vast scales.

SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

SetupBench: introduces, "SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments", with SetupBench (Benchmark tasks), Agent (LLM software engineering agent), Bare Linux Sandbox (Execution environment), Evaluation Harness (Automated validation), Docker Image (Task container), Validation Command (Success verification), where the paper presents a benchmark for evaluating LLM agents' ability to set up software development environments.
The benchmark includes 93 tasks across four categories, each providing a natural-language problem statement, workspace snapshot, and deterministic validation command.
Agents are evaluated in fresh, minimal Linux containers using an automated harness that runs the agent and verifies task completion via the validation command.

How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs

MLPO (Multi-agent guided Leader Policy Optimization): introduces a hierarchical multi-agent framework with a trained Leader LLM, an Agent Team (untrained off-the-shelf LLMs), Task Input, Agent Responses, Leader Output, Feedback, Final Answer, SFT (leader pre-training phase), and MLPO (leader training objective), where a single trained leader coordinates untrained agents for collaborative reasoning.
The leader processes Task Input and Agent Responses, generating Leader Output (reasoning and answer) which serves as Feedback for subsequent rounds, ultimately producing the Final Answer.
The framework trains only the leader using SFT and the MLPO objective, enabling it to effectively evaluate and synthesize agent contributions and also perform well independently.

SIMAGENTS: Bridging Literature and the Universe Via A Multi-Agent Large Language Model System

SIMAGENTS (Bridging Literature and the Universe Via A Multi-Agent Large Language Model System): introduces a multi-agent system with Parameter Extraction (extracts simulation parameters), Physics Agent (interprets papers domain knowledge), Software Agent (enforces software constraints), Post-Simulation Processing (generates analysis code), and Analysis Code Writer (generates analysis scripts) components.
The system automates cosmological simulation parameter configuration from literature and preliminary analysis using specialized LLM agents.
SIMAGENTS agents collaborate through structured communication to ensure extracted parameters are physically meaningful, consistent, and software-compliant.

Role-Playing LLM-Based Multi-Agent Support Framework for Detecting and Addressing Family Communication Bias

Role-Playing LLM-Based Multi-Agent Dialogue Support Framework: introduces a multi-stage, multi-agent LLM system that analyzes parent-child dialogues to detect suppressed emotion and ideal parent bias, then generates empathetic and actionable feedback.
The framework utilizes a Dialogue D (Input dialogue) processed by a Suppressed Emotion Detection Agent (Asup), Auxiliary Attribute Estimation Agent (Aattr), and Ideal Parent Bias Detection Agent (Abias), with outputs integrated by a Meta-Agent (Ameta) into Child Report (Rchild) and Adult Report (Radult).
Selected Expert Agents (Eselect), chosen from an Expert Agents Pool (E) using BERT (Calculates embedding similarity), collaboratively generate feedback through a four-step discussion, which is then synthesized by a Final Meta-Agent (Afinal) into Final Feedback for Child (Ffinal,child) and Final Feedback for Adult (Ffinal,adult) to support positive family communication.

Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents

M1-Parallel: introduces a framework that concurrently runs multiple multi-agent teams in parallel to uncover distinct solution paths, leveraging an event-driven communication model with asynchronous messaging to reduce end-to-end latency or boost task completion rates.
The framework includes a Centralized Manager, a Plan Generation Function, multiple Multi-agent Teams (each comprising an Orchestrator and specialized agents like WebSurfer, FileSurfer, Coder, and ComputerTerminal), a Global Memory Module, and an Aggregator.
M1-Parallel operates in either an Early-stop mode, terminating when the fastest team completes, or an Aggregation mode, combining answers from multiple teams to improve task completion.

ARPACCINO: An Agentic-RAG for Policy as Code Compliance

ARPACCINO: introduces an agentic system for Policy as Code (PaC) compliance, integrating an LLM Engine (core reasoning engine), RAG Tool (accesses domain knowledge), Terraform Tool (pre-processes IaC), Rego Rules Checker Tool (verifies policy rules), Policy Validation Tool (assesses IaC compliance), and Persistent Knowledge (stores domain data).
This system automates the generation and verification of PaC rules from natural language descriptions, iteratively refining Infrastructure as Code (IaC) configurations for conformance.
By combining LLMs, Retrieval-Augmented-Generation, and specialized tools, the system enhances automation, reliability, and accessibility of PaC workflows, even with smaller LLMs.

An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation

EmoSApp (Emotional Support App): introduces an offline, smartphone-based conversational AI app for mental health support, leveraging a fine-tuned LLaMA-3.2-1B-Instruct model, Torchtune for optimization, Executorch for on-device inference, and a domain specialization approach combining knowledge and conversational datasets.
The system addresses limitations of existing solutions by enabling entirely offline operation, enhancing data privacy, and delivering responsive performance on resource-constrained mobile devices through LLM quantization.
Qualitative and quantitative evaluations demonstrate the app's ability to provide coherent, empathetic, and contextually appropriate mental health support, serving as a blueprint for portable AI-driven solutions.

Behavioral Exploration: Learning to Explore via In-Context Adaptation

BE (Behavioral Exploration): introduces a novel approach for training autonomous agents, utilizing a long-context policy (generates expert actions), history (past observations context), state (current environment state), action (agent's output action), future trajectory (predicted future path), coverage (exploratory behavior measure), behavioral cloning loss (mimics expert actions), expert demonstration data (offline training dataset), coverage conditioning value (regulates exploration degree), diffusion model, transformer backbone, state token, coverage-to-go token, history state tokens, and task label (optional task conditioning).
This framework enables agents to learn data-driven exploratory behavioral policies that adapt quickly online, restricting exploration to coherent, reasonable behaviors derived from expert demonstrations.
The approach leverages offline training and in-context online adaptation, demonstrating effectiveness in simulated locomotion, manipulation, and real-world robotic tasks.

ACCELERATING DRUG DISCOVERY THROUGH AGENTIC AI: A MULTI-AGENT APPROACH TO LABORATORY AUTOMATION IN THE DMTA CYCLE

Tippy (a novel agentic AI framework): introduces a multi-agent system for laboratory automation in drug discovery, featuring Supervisor, Molecule, Lab, Analysis, Report, and Safety Guardrail agents, designed to accelerate DMTA cycles.
This framework leverages autonomous AI agents that reason, plan, and collaborate, integrating with laboratory infrastructure via the Model Control Protocol, LIMS, ELN, and analytical instrument data systems.
The system demonstrates significant improvements in workflow efficiency, decision-making speed, and cross-disciplinary coordination, providing a new paradigm for AI-assisted drug discovery.

SPLASH! Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations

SPLASH (Sample-efficient Preference-based inverse reinforcement learning for Long-horizon Adversarial tasks from Suboptimal Hierarchical demonstrations): introduces a novel IRL algorithm for learning from suboptimal demonstrations in long-horizon and adversarial settings, incorporating options-level demonstrations, behavioral cloning, downsampled full trajectory pairs, success and progress-based learning constraints, and temporal consistency regularization.
The framework empirically validates its approach on a maritime capture-the-flag task in simulation and demonstrates real-world applicability with sim-to-real translation experiments on autonomous unmanned surface vehicles.
It significantly outperforms state-of-the-art methods in reward learning from suboptimal demonstrations by addressing challenges specific to long-horizon and adversarial tasks.

OnlineBEV: Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception

OnlineBEV (Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception): introduces a novel recurrent temporal fusion framework for multi-camera 3D perception, utilizing a Motion-Guided BEV Fusion Network (MBFNet) for feature alignment and Heatmap-Based Temporal Consistency Learning (HTC-loss) for explicit supervision.
The framework effectively aggregates BEV features across frames using a recurrent design, compensating for spatial misalignment caused by object motion through spatio-temporal deformable attention.
This approach achieves state-of-the-art performance in camera-only 3D object detection, BEV segmentation, and 3D occupancy prediction on the nuScenes benchmark.

10th July 2025

The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

FOO (Flaws-of-Others): introduces an LLM-driven framework for scientific knowledge production, with User Task (Initial request), Agents (LLMs) (Multiple LLMs), Initial Answers (First responses), Critiques (Peer evaluations), Harmoniser(s) (Aggregating critiques), Judgement (Synthesized feedback), Revised Answers (Updated responses), and Convergence Test (Stopping condition).
The framework models invalidation propagation in a discursive network of LLM agents and humans, defining invalidation as any factual, logical, or structural breach.
The FOO algorithm operationalizes cross-network detection by having agents critique each other's outputs iteratively, aiming to reduce the prevalence of false statements.

PyVision: Agentic Vision with Dynamic Tooling

PyVision: introduces an agentic framework enabling MLLMs to dynamically generate and execute Python code for multimodal reasoning, featuring an MLLM, a Runtime Environment, a Multi-turn Interaction Loop, and Dynamically Generated Tools.
The framework operates as a multi-turn loop where the MLLM generates code executed in an isolated Python runtime, with the results fed back to the MLLM's context for iterative refinement.
This approach allows the model to create task-specific tools on the fly, leveraging Python libraries for flexible, grounded, and interpretable visual reasoning.

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX: introduces a modular, multi-agent memory system for LLM-based agents, with Core Memory (User info, context), Episodic Memory (Events, experiences), Semantic Memory (Concepts, entities), Procedural Memory (Guides, workflows), Resource Memory (Files, documents), Knowledge Vault (Sensitive information), Meta Memory Manager (Coordinates memory managers), Memory Managers (Manage specific memory types), and Chat Agent (User interaction, query processing), designed to enable LLMs to remember diverse, long-term user data at scale.
The system employs a multi-agent architecture with specialized memory components and managers to handle heterogeneous information and facilitate effective retrieval and updates.
MIRIX demonstrates improved accuracy and storage efficiency on multimodal and long-form conversational benchmarks compared to existing memory systems.

Agentic Retrieval of Topics and Insights from Earnings Calls

Agentic Framework: introduces an LLM-driven system to dynamically retrieve and organize financial topics from earnings calls, with Earnings Call Documents (Input data), Topic Retriever (Extracts topics and excerpts), Extracted Topics & Excerpts (Output of retriever), Ontologist Sub-Agent (Validates and integrates topics), Novelty Verification (Checks if topic exists), and Ontology Data Structure (Stores topics hierarchically).
The framework uses a Topic Retriever LLM to extract relevant topics and contextual excerpts from text.
An Ontologist Sub-Agent LLM manages a continuously evolving hierarchical topic ontology by verifying novelty and integrating new or updated topics.

Automating MD simulations for Proteins using Large language Models: NAMD-Agent

NAMD-Agent: introduces an automated pipeline, with LLM Agent (orchestrates workflow), Retrieval-Augmented Generation (RAG) (code retrieval), Curated Codebase (automation scripts), PDBFixer (structure preprocessing), CHARMM-GUI (system setup tool), Selenium (web automation), NAMD (simulation engine), and Post-processing Tools (analysis, visualization), that leverages LLMs, python scripting, and web automation to streamline MD input file generation and simulation.
The system uses a ReAct-based agent powered by Gemini-2.0-Flash and LlamaIndex to interpret user queries, generate and execute code, and interact with external tools like CHARMM-GUI and NAMD.
The RAG framework enhances the LLM's ability to generate accurate automation scripts by retrieving relevant code templates and API patterns from a curated repository.

DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

DocCHA: introduces a modular, confidence-aware framework emulating clinical reasoning with LLMs, including Symptom Collection Module (elicits symptoms, guides questioning), History Acquisition Module (collects history, controls depth), and Causal Graph Construction and Refinement Module (constructs causal graph, refines reasoning).
The framework decomposes the diagnostic process into three sequential stages, each powered by an LLM backend and guided by interpretable confidence scores.
Each module uses confidence metrics to guide adaptive questioning, prioritize information, and refine reasoning links for structured and transparent diagnosis.

Position: We Need An Algorithmic Understanding of Generative AI

AlgEval: introduces a framework for systematically researching the algorithms LLMs learn and use, utilizing LLM system, algorithmic primitives, algorithmic grammars, algorithms, algorithm identification and interpretability methods, empirical verification and theoretical analysis, and improved design and insights.
The framework aims to uncover algorithmic primitives and their composition by analyzing latent representations, attention, and inference-time compute.
A case study on graph navigation demonstrates applying attention and representation analysis to evaluate hypothesized search algorithms like BFS and DFS.

Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

LLM Analysis Framework: introduces, with Large Language Model (core model), Input String (prompt/data), Output String (generated response), Tokens (text units), and LLM-based Agent (system using LLM), an analysis showing that LLMs are limited by their computational complexity.
The paper argues that LLMs cannot correctly perform or verify tasks whose complexity exceeds the LLM's core operation complexity of O(N².d).
This limitation implies that LLMs and LLM-based agents will hallucinate when faced with computationally complex tasks like matrix multiplication or verifying TSP solutions.

StarDojo: Benchmarking Open-Ended Behaviors of Agentic Multimodal LLMs in Production-Living Simulations with Stardew Valley

StarDojo: introduces a novel environment and benchmark based on Stardew Valley for evaluating agentic MLLMs in production-living simulations, featuring StarDojo Environment (Simulation platform), StarDojoMod (Game engine interface), Python Wrapper (Agent interaction layer), Observation Space (Visual and textual data), Action Space (Agent command set), Task Evaluator (Progress monitoring), and Simulator APIs (Environment configuration).
The platform provides a unified interface, automated evaluation, system compatibility, and parallelized environments to facilitate research on agents capable of complex decision-making.
StarDojo offers a comprehensive observation space combining visual screenshots and structured textual information, and an abstracted action space to enable robust agent interaction and evaluation.

SAND: Boosting LLM Agents with Self-Taught Action Deliberation

SAND (Self-taught Action Deliberation): introduces a self-learning framework to equip LLM agents with explicit action deliberation, utilizing a Base LLM (initial model), Trainable LLM (agent being finetuned), Expert Trajectories (initial successful data), Self-Consistency Action Sampling (samples candidate actions), Execution-Guided Action Critique (generates critiques from rollouts), Action Deliberation Synthesis (creates deliberation thoughts), Deliberation Trajectories (self-augmented training data), Iterative Finetuning (repeated model updates), and an Inconsistency Indicator (flags deliberation need).
The framework iteratively generates deliberation thoughts by sampling candidate actions, critiquing their rollouts, and synthesizing reasoning using the base LLM.
The self-augmented deliberation trajectories are then used to finetune the LLM agent, teaching it when and what to deliberate for improved decision making.

DrugMCTS: a drug repurposing framework combining multi-agent, RAG and Monte Carlo Tree Search

DrugMCTS: introduces a drug repurposing framework with Retrieval Agent (Identifies relevant molecules/proteins), Molecule-Analysis Agent (Evaluates molecule properties), Molecule-Selection Agent (Filters candidate molecules), Interaction-Analysis Agent (Interprets drug-target interactions), Decision Agent (Integrates evidence, selects protein), Monte Carlo Tree Search (Guides iterative search/decision), and LLM (Qwen2.5-7B-Instruct) (Performs reasoning/analysis tasks).
The framework integrates RAG, multi-agent collaboration, and MCTS to enable structured and iterative reasoning for drug-target interaction prediction.
It leverages a data processing pipeline to transform scientific data into formats more interpretable by LLMs and uses a reward calculation mechanism within MCTS for feedback-driven search.

KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows

KVFlow: introduces a workflow-aware KV cache management framework for LLM-based agentic workflows, utilizing an Agent Step Graph, Steps-to-Execution, Workflow-Aware Eviction Policy, Overlapped KV Prefetching, and Status-Aware Scheduling to improve efficiency.
The framework abstracts agent execution as an Agent Step Graph to compute steps-to-execution values, guiding a fine-grained eviction policy at the KV node level.
It incorporates a fully overlapped KV prefetching mechanism combined with status-aware scheduling to proactively load required tensors and avoid cache miss stalls.

Effect of Static vs. Conversational Al-Generated Messages on Colorectal Cancer Screening Intent: a Randomized Controlled Trial

Single AI Message Approach: introduces a randomized controlled trial comparing a single LLM-generated message and an LLM chatbot conversation, both tailored to demographics, against expert materials and a no-message control for increasing colorectal cancer screening intent.
The study found that both AI interventions significantly increased stool test intent compared to expert materials and control, but neither improved colonoscopy intent over expert materials.
A concise, demographically tailored single AI message was as effective as a longer, interactive AI chatbot conversation for boosting stool test intent, suggesting scalability benefits for simpler AI messaging.

Reasoning and Behavioral Equilibria in LLM-Nash Games: From Mindsets to Actions

LLM-Nash framework: introduces a game-theoretic model where agents use LLMs guided by reasoning prompts to make decisions, defining equilibrium over the prompt space which induces behavioral outcomes.
This framework explicitly models the reasoning process, capturing bounded rationality and enabling analysis of cognitive constraints and mindset expressiveness.
Unlike classical games, LLM-Nash games define equilibrium at the reasoning level, where agents optimize prompts to maximize expected utility via LLM-generated actions.

A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking

Purple Agent: introduces a dynamic Stackelberg game framework for LLM jailbreaking defense, featuring the Purple Agent (Agentic AI defender) interacting with the LLM (Target LLM), using DeployDefense (Deploys defenses) and SimulateRedExpansion (Simulates attacker exploration) guided by an Internal Defense Policy (Guides defense strategy) and RRT (Exploration algorithm).
The framework models LLM jailbreaking as a sequential extensive-form game between an attacker and the defender (Purple Agent).
The Purple Agent proactively anticipates potential adversarial paths by simulating attacker behavior using RRT-based exploration and deploys targeted interventions.

KP-A: A Unified Network Knowledge Plane for Catalyzing Agentic Network Intelligence

KP-A (A Unified Network Knowledge Plane for Agentic Network Intelligence): introduces a layered architecture with a Network Knowledge Plane positioned between the Network Data Ontology Plane and the Network Intelligence Plane, including Intelligence Agents, Utility Agents, Knowledge Query Tool / Model Context Protocol Server, Live Network Data Endpoints, Static Data Explanation Endpoints, UE, Cell, BaseStation, RIC, CoreNetwork, EdgeServer, Base Stations, Edge Server, Core Network, Cells, and User Equipments.
The Network Knowledge Plane serves as a unified source of truth, providing intuitive and consistent access to dynamic and static network knowledge for LLM-powered agents.
The architecture decouples knowledge acquisition from consumption, enabling reusable knowledge access for diverse network intelligence tasks.

MCPmed: A Call for MCP-Enabled Bioinformatics Web Services for LLM-Driven Discovery

MCPmed: introduces a layered architecture for bioinformatics web services, with UI, API layer, and MCP layer components, enabling LLM-driven discovery.
The MCP layer provides a standardized, machine-actionable interface over existing APIs, associating endpoints with scientific concepts and metadata for LLMs.
Breadcrumbs offer a transition mechanism for legacy services, while LLM agent researchers leverage the MCP layer for autonomous data exploration and analysis using components defined by types.Tool.

9th July 2025

The User-Centric Geo-Experience: An LLM-Powered Framework for Enhanced Planning, Navigation, and Dynamic Adaptation

Agent-based Travel Smart Assistant: introduces an LLM-powered framework with Travel Planning Agent (plans trips, explores areas), Destination Assistant Agent (navigates final leg), and Local Discovery Agent (adapts, finds alternatives).
The framework integrates planning, navigation, and dynamic adaptation using cooperative agents and multimodal LLMs to address gaps in traditional systems.
This system enhances user experience by handling complex queries, providing precision navigation, and adapting to real-world disruptions.

Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues

Llama 3.2 3B (fine-tuned with LoRA): introduces, "Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues", with Llama 3.2 3B model (base LLM), LoRA (fine-tuning method), Dialogue history (text input), Previous turn move labels (optional input), Predicted tutor move (output), and Predicted student outcome (output), evaluating LLMs and baselines on predicting tutor moves and student outcomes in tutoring dialogues.
The study compares fine-tuned Llama 3.2 3B and zero-shot GPT-40 LLMs against traditional Markov Chain, Logistic Regression, and LSTM baselines.
Experiments on MathDial and AlgebraNation datasets show LLMs outperform baselines, but predicting future tutor strategy remains challenging, while student outcome prediction is more tractable.

The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover

Agentic AI Systems: introduces an evaluation of LLM agents as attack vectors for computer takeover by exploiting trust boundaries in agentic AI systems, including LLM (core engine), Agent (autonomous entity), Perception (processes inputs), Storage (memory/knowledge), Planning/Reasoning (decides actions), Actions (executes tasks), Tools (external capabilities), Retrieval (searches knowledge), Knowledge (external database), Multi-Agent System (interacting agents), and Inter-Agent Communication (agent interaction).
The paper demonstrates three attack surfaces: direct prompt injection, RAG backdoor attacks, and inter-agent trust exploitation, showing that LLMs can be coerced into installing and executing malware.
A vulnerability hierarchy is established, revealing that inter-agent trust exploitation is the most effective attack vector, often bypassing defenses against direct prompts or RAG attacks.

SkyVLN: Vision-and-Language Navigation and NMPC Control for UAVs in Urban Environments

SkyVLN: introduces a framework integrating vision-and-language navigation with Nonlinear Model Predictive Control for UAVs, featuring Multimodal Perception (Processes visual and linguistic inputs), Visual Observations (RGB, depth, semantic images), Visual Foundation Model (Detects visual landmarks), LLM (Sub-goal Extraction) (Interprets instructions, extracts sub-goals), Wayfinding Prompt Optimization (WPO) (Refines localization, adds spatial/historical context), High-resolution Spatial Descriptor (HSD) (Describes landmark spatial relationships), TrackBack Memory Array (TBMA) (Stores historical path/instructions), Action Decision Module (Generates control commands), Nonlinear Model Predictive Control (NMPC) (Handles trajectory tracking, obstacle avoidance), Airsim Attitude Controller (Translates NMPC to motor commands), and LLM Motion Generator (Outputs thoughts and actions).
The framework leverages LLMs to interpret natural language instructions and visual observations, enabling navigation in dynamic 3D urban spaces with improved accuracy and robustness.
Key components like the spatial verbalizer and history path memory enhance the UAV's ability to handle ambiguous instructions and complex spatial reasoning tasks.

The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

FOO (Flaws-of-Others): introduces an LLM-driven framework for scientific knowledge production, with User Task (Initial request), Agents (LLMs) (Multiple LLMs), Initial Answers (First responses), Critiques (Peer evaluations), Harmoniser(s) (Aggregating critiques), Judgement (Synthesized feedback), Revised Answers (Updated responses), and Convergence Test (Stopping condition).
The framework models invalidation propagation in a discursive network of LLM agents and humans, defining invalidation as any factual, logical, or structural breach.
The FOO algorithm operationalizes cross-network detection by having agents critique each other's outputs iteratively, aiming to reduce the prevalence of false statements.

InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

InvestAlign: introduces a framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems, with components: Theoretical solution (Mathematical solution), Simple problem (Simplified investment scenario), Complex problem (Original investment scenario), Training dataset (Data generated from theoretical solution), Pre-SFT LLM (Base LLM before fine-tuning), and InvestAgent (LLM fine-tuned with generated data).
This approach addresses data scarcity for aligning LLMs with investor decision-making processes under herd behavior.
Training LLMs on the generated data achieves faster parameter convergence and closer alignment to real-user data than using real-user data directly.

Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration

Gradientsys: introduces a multi-agent scheduling framework, with a Constellation LLM Scheduler (orchestrates agents), Tool Registry (stores tool information), Specialized AI Agents (perform specific tasks), Model-Context Protocol (standard tool interface), ReAct Reasoning Engine (LLM planning loop), Observability Module (streams traces), Hybrid Sync/Async Execution (manages parallel calls), Scratchpad (stores reasoning steps), and Info Cache (caches intermediate info), designed to coordinate diverse specialized AI agents for complex tasks.
It leverages an LLM-powered scheduler using ReAct for dynamic planning and supports parallel execution of heterogeneous agents via a standardized MCP interface.
The framework includes a robust retry-and-replan mechanism and streams real-time agent activity and reasoning for transparency.

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

FMSP (Foundation-Model Self-Play): introduces a new paradigm combining self-play with foundation models, including Foundation Model (Generates/improves policies), Policy (Code-based strategy), Agent (Embodies policy), Archive (Stores policies), Competition (Agents interact), Context (FM input), Evaluation (Measures performance/diversity), and Sandbox (Safe code execution), to enable open-ended strategy discovery in multi-agent games.
The framework leverages LLMs' code generation and knowledge to create diverse and high-quality policies, overcoming limitations of traditional self-play like local optima and lack of diversity.
FMSP variants like QDSP demonstrate superior performance and diversity in tasks like Car Tag and LLM red teaming by balancing exploration and exploitation through FM-powered search and archiving.

Multi-Agent Retrieval-Augmented Framework for Evidence Based Counterspeech Against Health Misinformation

MA (Multi-Agent Retrieval-Augmented Framework): introduces a system for generating evidence-based counterspeech against health misinformation, utilizing a Misinformation Post (Input), Static Retrieval Agent (Gathers static evidence), Dynamic Retrieval Agent (Fetches real-time evidence), Retrieve Knowledge Base (Local static data source), DuckDuckGo Web Search (Real-time dynamic source), Combined Retrieved Evidence (Merged static/dynamic evidence), Summarization Agent (Filters and condenses evidence), Filter Summarized Evidence (Processed evidence output), Top-Ranked Evidence (Selected relevant evidence), Counterspeech Generation Agent (Creates initial response), Raw Counterspeech Response (Initial generated response), Refinement Agent (Improves generated response), and Refined Response (Final polished output).
The framework employs multiple specialized LLM agents in sequence to retrieve, filter, summarize, generate, and refine responses.
Integrating both static and dynamic knowledge sources enhances the relevance, informativeness, and factual accuracy of the generated counterspeech.

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

ViDove: introduces a multimodal translation agent system that integrates visual, audio, and textual inputs with a memory system and specialized LLM-based agents (Vision, Auditory, Translation, Proofreader, Editor) to enhance translation quality for long-form video.
The system leverages a memory system comprising short-term and long-term modules to provide context-aware translation and adapt to domain-specific knowledge.
A multi-agent post-editing module with Proofreader and Editor agents refines the initial translation through collaborative review and user instructions.

Application of LLMs to Multi-Robot Path Planning and Task Allocation

Approach: introduces a method for expert exploration in multi-agent reinforcement learning, utilizing an Ensemble of Mixer Networks for Uncertainty Estimation to decide whether to query the MARL Algorithm (QMIX) policy or an Expert Planner (A* or LLM (Vicuna-7B)), which involves a Planning Prompt Generator, Tokenizer, and Action Transformer, with collected Data Collection used for Batch Creation and Model Update Module.
The system integrates an LLM (Vicuna-7B) as an expert planner to guide exploration for a QMIX-based multi-agent system in grid environments when the agent's intrinsic uncertainty is high.
Experiments compare the performance of QMIX with RNN, QMIX with Attention, QMIX with Attention using A* as expert, and QMIX with Attention using Vicuna-7B as expert, showing improved performance with expert guidance, particularly from the LLM.

Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

cmbagent: introduces a multi-agent system for autonomous scientific discovery, with Planner (proposes plan), Plan Reviewer (provides plan feedback), Controller (orchestrates plan execution), Engineer (handles coding tasks), Researcher (performs reasoning tasks), Executor (executes code locally), Post-execution Interpreter (decides next agent), Installer (installs missing packages), Terminator (ends session), Context Agents (specialized with extended context), and RAG Agents (retrieve information using RAG), implementing a Planning & Control strategy for end-to-end task execution without human intervention.
The system leverages approximately 30 LLM agents, each specializing in different tasks, orchestrated via a robotics-inspired Planning & Control workflow built upon the AG2 framework.
Key features include specialized agents for research papers and code libraries, feedback loops between agents, structured output generation, and local code execution capabilities.

Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics

SciRag: introduces a modular framework for evaluating Retrieval-Augmented Generation (RAG) agents, with document preprocessing (processes research papers), retrieval (finds relevant document chunks), and generation (uses LLMs to generate answers) components.
The framework utilizes the CosmoPaperQA benchmark dataset and a dual human expert and LLM-as-a-Judge evaluation approach to assess RAG agent performance in astrophysics.
Different RAG configurations, varying LLMs, embedding models, and retrieval strategies, are systematically compared for accuracy and cost-efficiency on expert-curated questions.

8th July 2025

SciMaster: Towards General-Purpose Scientific AI Agents Part I. X-Master as Foundation — Can We Lead on Humanity's Last Exam?

X-Masters (Scattered-and-Stacked Agentic Workflow): introduces a workflow that orchestrates multiple X-Master agents in specialized roles, including Solver, Critic, Rewriter, and Selector, to systematically enhance reasoning breadth and depth.
This framework leverages individual X-Master agents, which are tool-augmented reasoning agents driven by an LLM, using code as an interaction language to flexibly interact with external tools.
The X-Master agent's core mechanism involves generating Python code for a Code Executor to access Tools like Web Search, Web Parse, and Python Libraries, with execution results appended back to the agent's context for iterative reasoning.

Representing Prompting Patterns with PDL: Compliance Agent Case Study

PDL (Prompt Declaration Language): introduces a novel declarative YAML-based language for specifying LLM prompts and workflows, with PDL Language (Declarative YAML syntax), PDL Interpreter (Executes PDL programs), Blocks (Program units), Context (Implicit message history), Tool Definitions (External function wrappers), Model Calls (LLM interactions), Parser (Output processing), Type System (JSON Schema validation), and Control Structures (Flow logic).
The language captures the composition of LLM calls, rule-based code, and external tools, abstracting away plumbing for improved productivity and optimization.
A case study demonstrates PDL's utility in a compliance agent, showing performance improvements by enabling customization of prompting patterns and agent architecture.

Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms

Function Calling: and Model Context Protocol (MCP): introduces a comparative vulnerability assessment of LLM agent deployment paradigms, evaluating Function Calling and MCP architectures using a unified threat framework and attack progression model.
The study reveals that architectural choices fundamentally reshape threat landscapes, with Function Calling showing higher system-centric vulnerabilities and MCP exhibiting increased LLM-centric exposure.
Analysis across simple, composed, and chained attacks demonstrates that attack complexity dramatically amplifies effectiveness, highlighting the critical impact of architectural critical paths on vulnerability exposure.

Too Human to Model: The Uncanny Valley of LLMs in Social Simulation When Generative Language Agents Misalign with Modelling Principles

No explicit framework name is provided: The paper describes a thought experiment on building an LLM-driven Bass diffusion model, outlining conceptual components including LLM agents with personalized prompts, a memory system, a conversation mechanism, a decision-making process, and a potential auxiliary cognitive system.
The thought experiment reveals five dilemmas arising from the mismatch between LLMs' natural language realism and the abstraction required for social simulation modelling.
The authors argue that LLM agents are better suited for social simulation purposes like situated role play and social learning rather than prediction or explanation focused on system-level emergence.

OPENAGENTSAFETY: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

OPENAGENTSAFETY (OA-SAFETY): introduces a comprehensive framework for evaluating AI agent safety in realistic scenarios, including an LLM Agent operating within a Docker Container with access to Real Tools and a Messaging Tool, interacting with a User and NPCs, guided by a Task Definition, and evaluated by a Rule-based Evaluator and an LLM-as-Judge, built upon OpenHands and Sotopia.
The framework supports over 350 multi-turn, multi-user tasks simulating diverse user intents and social dynamics across eight critical safety risk categories.
A hybrid evaluation approach combines rule-based checks for concrete environmental changes with LLM-as-Judge assessments to capture subtle unsafe behaviors and reasoning.

Conditional Multi-Stage Failure Recovery for Embodied Agents

CMFR (Conditional Multi-stage Failure Recovery): introduces a framework for embodied agents using zero-shot chain prompting, with Planning (Generates initial plan), Execution (Executes subgoals), Object Search (Finds target objects), Scene Representation (Stores visual information), and CMFR (Handles execution failures) structured into CMFR Stage 1 (Checks subgoal importance), CMFR Stage 2 (Checks preconditions), CMFR Stage 3 (Finds workarounds), and CMFR Stage 4 (Post-execution reflection) components.
The approach leverages large language models' reasoning abilities to analyze execution challenges and devise strategic solutions within the environmental context.
The multi-stage recovery process operates conditionally, with the first three stages addressing subgoal failures during execution and the final stage functioning as a post-execution reflection phase.

Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models

MAD (Multi-Agent Debate): introduces a debate-based system for requirements classification with an LLM Coordinator, Functional Debater, Non-Functional Debater, and Judge Agent.
The system involves debaters presenting arguments and a judge making a final classification decision based on the debate.
Empirical evaluation shows MAD strategies improve classification accuracy compared to a single agent baseline, albeit at a higher computational cost.

Constella: Supporting Storywriters' Interconnected Character Creation through LLM-based Multi-Agents

Constella (LLM-based Multi-Agents): introduces, "Constella supports storywriters' interconnected character creation with FRIENDS DISCOVERY (generates related characters), JOURNALS (produces diary entries), COMMENTS (enables character responses), Character Profiles (store character details), Relationship Attributes (define character connections), and Stateless Generation (lacks persistent memory), where Constella is an LLM-based multi-agent tool designed to help writers create and manage character casts and their relationships.
The tool leverages a social media metaphor to provide intuitive interactions for expanding character ensembles, exploring inner thoughts, and manifesting relationships.
Constella's design, including deliberate constraints and intermediary outputs, aims to preserve authorial agency while supporting creative exploration.

Large Language Models for Agent-Based Modelling: Current and possible uses across the modelling cycle

LLMs for Agent-Based Modelling: introduces the potential uses of Large Language Models (Assist ABM cycle phases), Agent-Based Modelling Cycle (Framework for social simulation), and LLM-Powered Agents (Agents using LLMs for decisions) across the ABM cycle.
The paper surveys current uses and reflects on opportunities and challenges of integrating LLMs into ABM.
LLMs can assist in various ABM phases, from problem formulation and system analysis to implementation, verification, validation, interpretation, and documentation, and can power agents directly.

ECom-Bench: Can LLM Agent Resolve Real-World E-commerce Customer Support Issues?

ECom-Bench: introduces a benchmark framework for evaluating multimodal LLM agents in e-commerce customer support, including an LLM Agent (customer service representative), User Simulator (persona-driven customer), Persona Dataset (user personality/behavior data), Task Dataset (realistic e-commerce tasks), Database (structured e-commerce data), Tools (agent functions/APIs), Domain Documentation (operational guidelines/world model), and Evaluation Module (performance assessment).
The framework utilizes persona-driven user simulation based on real customer interactions and a realistic task dataset derived from authentic e-commerce dialogues to provide a comprehensive evaluation platform.
ECom-Bench evaluates agent capabilities across various business scenarios, including multimodal interactions, using a defined set of tools implemented with a Model Context Protocol.

LLMs are Introvert

SIP-enhanced cognitive architecture: introduces a framework for LLM agents to improve social intelligence, incorporating Social Cognitive Memory (Stores generalized social knowledge), Social Behavior Memory (Records specific past interactions), Social Interaction Memory (Temporary buffer immediate stimuli), Observation (Selectively attends social cues), Planning (Prioritizes establishes goals), and Execution (Selects assesses responses).
This architecture integrates memory models and a decision-making procedure based on the Social Information Processing (SIP) theory to enable more human-like social cognition in LLM agents.
Experimental results show that the SIP-enhanced agents exhibit improved performance in social simulations, demonstrating better alignment with human social behavior.

How Not to Detect Prompt Injections with an LLM

KAD (known-answer detection): introduces a framework for detecting prompt injections using a Detection LLM (classifies input contamination), Detection Instruction (prompts detection LLM), Secret Key (known expected output), and Detection Rule (determines contamination), where the Backend LLM (performs target task) receives Contaminated Data (input with injected task) crafted by a DataFlip (adaptive attack strategy).
The paper identifies a structural vulnerability in KAD where the Detection LLM can be coerced by an adaptive attack like DataFlip to reveal the Secret Key despite the input being contaminated.
DataFlip exploits this flaw to evade KAD detection while simultaneously inducing the Backend LLM to execute the injected task.

AI Agent Smart Contract Exploit Generation

A1 (Agentic Exploit Generation System): introduces an agentic system that transforms LLMs into end-to-end exploit generators, including an LLM Agent (Autonomously decides tool usage), Source Code Fetcher Tool (Retrieves contract source code), Constructor Parameter Tool (Extracts constructor parameters), State Reader Tool (Queries contract state), Code Sanitizer Tool (Removes non-essential code), Concrete Execution Tool (Validates exploit strategies), and Revenue Normalizer Tool (Converts token values).
The system leverages six domain-specific tools and concrete execution feedback to enable autonomous vulnerability discovery and exploit generation.
A1 generates profitable Proof-of-Concepts by understanding smart contract behavior, generating strategies, testing on blockchain states, and refining approaches based on execution outcomes.

GAF-GUARD: AN AGENTIC FRAMEWORK FOR RISK MANAGEMENT AND GOVERNANCE IN LARGE LANGUAGE MODELS

GAF-Guard (Governance Agentic Framework): introduces an agentic framework for LLM governance, with User, REST API/CLI, LLM models, Orchestrator, Memory management, CoT questionnaire, Risk generator, Human-in-the-Loop (HITL), Risk assessment, Drift monitor, Incident reporting, Guardrails, Security, Policy, Risks, Metrics, State, Memory, and Function call for LLM outputs components, designed to detect and monitor risks associated with LLM deployment based on use-case and user preferences.
The framework employs autonomous agents orchestrated to identify risks, activate detection tools, facilitate continuous monitoring, and report incidents within specific LLM use-cases.
It supports pre-deployment risk assessment via questionnaires, post-deployment real-time monitoring for drift and security threats, and automated incident reporting, incorporating human-in-the-loop feedback.

7th July 2025

Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

Trojan Horse Prompting: introduces a novel jailbreak technique by forging the model's own past utterances within the conversational history, bypassing safety mechanisms.
This attack exploits the Asymmetric Safety Alignment Hypothesis, where models implicitly trust their own purported conversational history, making them vulnerable to malicious payloads attributed to the model role.
The technique involves injecting a forged model message containing harmful instructions, followed by a benign user prompt, to trigger the generation of policy-violating content from the target conversational multimodal model.

Evolutionary and Coevolutionary Multi-Agent Design Choices and Dynamics

Agent Training System Components: introduces a system for training cyber agents using evolutionary and coevolutionary algorithms with different controller representations in the CybORG simulation environment, potentially incorporating an LLM for mutation, where agents compete against adversary agents.
The system evaluates combinations of algorithms (GA, ES, GE, GE-LLM) and representations (Action Selection Matrix, Context Free Grammar) under one-sided evolution and two-sided coevolution dynamics.
Performance is measured by agent fitness (reward) within the CybORG environment, comparing the effectiveness of different algorithmic and representational choices.

Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment

WikiHowAgent: introduces a multi-LLM agent workflow for procedural learning and pedagogic quality assessment, with Teacher Agent (Provides instructions, answers questions), Learner Agent (Simulates understanding, generates responses), Interaction Manager (Manages conversation flow, progress), Evaluator (Assesses conversation quality), Memory (Stores conversation state, history), Tutorial (Instructional content input), Conversational Graph (Structures conversation turns), Evaluation Metrics (Measures conversation quality), and Human Judges (Provide human evaluation).
The workflow simulates interactive teaching-learning conversations using LLM-powered agents and assesses pedagogic quality through diverse metrics and human judgment alignment.
WikiHowAgent leverages large-scale tutorial content to enable dynamic teaching-learning simulations and provides a comprehensive evaluation protocol for LLMs in educational contexts.

Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications

LLM-based Clinical Information Extraction Approach: introduces methods for nurse observation and medical order extraction using LLMs, evaluated on new SYNUR and SIMORD datasets.
The nurse observation extraction method includes segmentation, RAG filtering using flowsheet schema, and LLM-based extraction.
The medical order extraction method utilizes LLMs with specific prompts to extract structured orders from doctor-patient conversations.

Spatio-Temporal LLM: Reasoning about Environments and Actions

ST-LLM (Spatio-Temporal LLM): introduces a spatio-temporal LLM for reasoning about environments and actions, incorporating a Vision Encoder, Point Cloud Encoder, Cross Modality Alignment Module with Learnable Queries, Positional Encoding, Image Projector, Query Projector, and LLM Decoder.
The framework fuses egocentric video features with 3D scene representations via a cross-modal alignment module and 3D positional encoding.
This approach improves spatio-temporal understanding by linking temporal observations with global spatial context for tasks like embodied AI.

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

MemoryAgentBench: introduces a benchmark for evaluating LLM agents' memory capabilities across four core competencies: accurate retrieval, test-time learning, long-range understanding, and conflict resolution.
The benchmark combines restructured existing datasets with newly constructed ones to provide a systematic and challenging testbed for assessing memory quality.
Empirical results show that current memory agents fall short of mastering all four competencies, highlighting the need for further research into comprehensive memory mechanisms.

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

StreamVLN: a streaming vision-and-language navigation framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language, and action inputs, utilizing a sliding-window dialogue context and a slow-updating memory context.
The framework addresses challenges in long-horizon context management and computational efficiency by using a sliding-window KV cache for responsive action decoding and voxel-based spatial pruning with temporal sampling for memory compression.
StreamVLN achieves coherent multi-turn dialogue and efficient KV cache reuse, enabling it to support long video streams with bounded context size and inference cost, demonstrating state-of-the-art performance with stable low latency.

CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

CREW-WILDFIRE: introduces an open-source benchmark for evaluating LLM-based multi-agent systems, with Procedurally Generated Environment (Creates scenarios), LLM-Ready Agentic Framework (Supports LLM agents), Agent State (Agent's internal state), Observations (Sensory input), Perception Module (Processes observations), Extracted Information (Perception output), Communication Framework (Manages communication), Messages (Agent exchanges), Chat History (Communication history), Action (Agent's intent), Execution Module (Translates actions), Primitive Library (Executable actions), Memory (Stores past data), and Heterogeneous Agents (Diverse roles/abilities).
The benchmark provides a realistic, scalable, and complex environment for evaluating agentic AI frameworks in wildfire response scenarios.
CREW-WILDFIRE supports both low-level control and high-level natural language interactions through its modular Perception and Execution modules.

From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems

Agentic Vehicles (AgVs): introduces a systems-level framework for intelligent vehicles, with a multi-layered architecture including Perception and sensing layer (Environmental data acquisition, mapping), Cognitive layer (Planning, prediction, ethical reasoning), Interaction layer (Natural language, multi-modal exchanges), Execution layer (Low-level vehicle control), Tool interface layer (Integrates APIs, infrastructure, services), Memory modules (Maintain context across interactions), and Reflection modules (Refine behaviors over time).
The framework distinguishes AgVs from traditional autonomous vehicles by emphasizing agency, goal adaptability, dialogic interaction, and tool invocation.
This conceptual shift is enabled by technologies like Generative AI, LLMs, and Reinforcement Learning, moving towards vehicles as collaborative agents in mobility ecosystems.

MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction

MARBLE (A Multi-Agent Rule-Based LLM Reasoning Engine): introduces a multi-agent hybrid reasoning system for accident severity prediction, with Core-Agent (Orchestrates agent interactions), ML Agent (Provides baseline ML prediction), Domain-Specific Agents (Perform domain-specific SLM reasoning), Prediction System (Synthesizes agent outputs), Rule-Based Coordination (Deterministic output aggregation), LLM-Based Coordination (SLM-guided output aggregation), and Final Decision Selection Logic (Integrates ML and coordinator outputs).
The framework decomposes the prediction task across specialized agents using small language models and a machine learning model, coordinated by a central agent.
MARBLE achieves high accuracy and interpretability by leveraging domain-specific reasoning and structured coordination mechanisms.

FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System

FurniMAS: introduces a multi-agent system for language-guided furniture decoration, including System Admin, Asset Selector, Asset Validator, Stylist, Style Validator, Planner, Plan Validator, Arranger, and Retriever agents.
The system processes user text prompts and furniture surface details to select, style, plan, and arrange assets for a final decorative outcome.
FurniMAS employs a hybrid team of LLM-based and non-LLM agents that collaborate through communication and validation across multiple stages.

LLM-based Question-Answer Framework for Sensor-driven HVAC System Interaction

JARVIS: introduces a two-stage LLM-based QA framework for sensor-driven HVAC interaction, with Expert-LLM (Interprets query, plans response), Agent (Executes instructions, processes data), Query Builder/Executor (Generates, runs SQL queries), Data Processor (Processes retrieved sensor data), Response Generator LLM (Generates final natural language response), and Time-series Database (Stores HVAC sensor data) components.
The Expert-LLM translates user queries into structured instructions, which the Agent uses to retrieve and process data from the Time-series Database via the Query Builder/Executor and Data Processor.
The Agent then employs a Response Generator LLM to produce the final natural language answer based on the processed data and Expert-LLM's guidance.

Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems

AGENTXPOSED: introduces a psychologically grounded detection framework for intention-hiding malicious agents in LLM-based multi-agent systems, with Establish Baseline (Profiles agent personality), Detect Signal (Monitors behavioral deviations), Follow & Verification (Conducts targeted inquiries), HEXACO (Personality model), and Reid Technique (Interrogation technique) components.
The framework integrates personality profiling and behavioral monitoring across three sequential stages to identify covert adversaries.
AGENTXPOSED demonstrates superior detection performance compared to other personality models and baseline attacks, particularly in layered communication structures.

UrbanMind: Towards Urban General Intelligence via Tool-Enhanced Retrieval-Augmented Generation and Multilevel Optimization

UrbanMind: introduces a tool-enhanced retrieval-augmented generation framework with Database Layer (stores urban data, tools), Retrieval Layer (extracts relevant information), Integration Layer (fuses retrieved knowledge), Adaptation Layer (updates model parameters), Knowledge Base (dynamic urban data repository), Tool Set (multi-domain functions), LLM (core language model), Continual Learning Module (incremental adaptation), Memory Management Module (coordinates retrieval, adaptation), Cloud-Edge Architecture (distributed deployment), Cloud Layer (central orchestration), Edge Layer (localized processing), and Adapters (lightweight fine-tuning models), designed to facilitate urban general intelligence in dynamic environments.
The framework leverages a multilevel optimization paradigm to jointly address continual retrieval, knowledge integration, and model adaptation.
UrbanMind supports flexible deployment via a Cloud-Edge architecture, enabling efficient computation and real-time responsiveness.

MindFlow: Revolutionizing E-commerce Customer Support with Multimodal LLM Agents

MindFlow: introduces a multimodal LLM agent framework for e-commerce customer support, processing Query (Input) through a Decision-Making Module (Generates, evaluates, selects plans), Memory Module (Stores context, knowledge), and Action Module (Executes internal, external actions) to generate a Response (Output), enhanced by MLLM-as-Tool (Treats MLLMs as tools) and Agent-Computer Interface (ACI) (Simplifies complex inputs).
The framework integrates memory, decision-making, and action modules for real-time, context-aware reasoning in complex multimodal scenarios.
The modular MLLM-as-Tool strategy treats MLLMs as specialized visual processing tools, improving visual-textual reasoning efficiency and robustness.

OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language Models

OASBuilder: introduces a novel framework for automating the generation of OpenAPI Specifications from online API documentation webpages, utilizing Scraping/Segmentation, Demonstrative Documentation Extraction, Descriptive Documentation Extraction, Demonstrative OAS Generation, Descriptive OAS Generation, OAS Merging, OAS Enhancement, and a UI.
The framework employs a multi-stage pipeline integrating LLMs and rule-based algorithms to process diverse and unstructured HTML documentation into structured OAS format.
OASBuilder generates initial partial OAS from both demonstrative examples and descriptive text, merges them, and then enhances the resulting specification using AI-powered tools for metadata enrichment.

Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving

DRP-IMO: introduces a novel framework for automated theorem proving, with a Reasoner (LLM) generating strategic subgoal lemmas, a Lemma Extraction Module extracting formal statements, a Subgoal Verification Prover (ATP Model) verifying these lemmas, and a Final Prover (LLM) constructing the final proof using verified lemmas.
This framework decouples high-level reasoning from low-level proof generation, addressing the gap between LLMs' informal reasoning and formal proving capabilities.
The modular design allows specialized models to excel at their respective tasks, enhancing problem-solving on complex mathematical challenges like IMO problems.

6th July 2025

R1-RE: Cross-Domain Relationship Extraction with RLVR

R1-RE (Reinforcement Learning with Verifiable Reward): introduces a framework for cross-domain relationship extraction, utilizing an LLM, Prompt, Reinforcement Learning (RLVR/GRPO), Reward Function, Annotation Guide, Sentence, and Generated Output to align LLM reasoning with human annotation.
The framework reframes relationship extraction as a reasoning task guided by annotation guidelines, improving out-of-domain robustness compared to traditional supervised fine-tuning.
A multi-stage, rule-based reward design incentivizes accurate predictions and adherence to the required output format, promoting step-by-step reasoning in the LLM.

MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

MOMENTS (Multimodal Mental States): introduces a comprehensive benchmark using Assigned Short Films and Assigned ToM Abilities, created via a pipeline involving Question Annotator, Distractor Annotator, Question Reviewer, Distractor Reviewer, and an LLM Copilot to produce the Annotated ToM Dataset for evaluating multimodal LLMs on Theory of Mind.
The benchmark features over 2,300 multiple-choice questions derived from realistic, long-form videos, assessing seven distinct ToM abilities.
An LLM-in-the-loop annotation framework is employed to generate challenging distractors and mitigate answer set biases observed in prior datasets.

WebSynthesis: World-Model-Guided MCTS for Efficient WebUI-Trajectory Synthesis

WebSynthesis: introduces a novel framework integrating a World Model with WebMCTS to synthesize web UI trajectories offline, utilizing a Policy Agent, World Model, Process Reward Model, and WebMCTS.
The framework leverages the World Model to simulate virtual web environments, enabling efficient and reversible tree-based planning guided by WebMCTS.
WebSynthesis employs a two-stage curriculum learning approach, including UI fundamental understanding and behavior cloning, to train the Policy Agent on synthesized trajectories.

Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties

AgentHazard: introduces, a scalable attack simulation framework for benchmarking mobile GUI agents against unprivileged third parties, with GUI hijacking tool (Modifies UI state/screenshot) and Attack module (Intercepts agent requests), where the framework simulates real-world misleading content attacks by injecting adversarial content into Android applications.
The framework utilizes a GUI hijacking tool as a native Android application to monitor and modify system UI state and screenshots in real-time, and an attack module to intercept agent requests for UI state and return the modified information.
This systematic investigation reveals that mobile GUI agents are vulnerable to misleading third-party content, highlighting the need for improved robustness and security mechanisms in agent design and training.

BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

BYOKG-RAG (Bring-Your-Own-KG RAG): introduces a framework that leverages an LLM (KG-Linker) to generate graph artifacts, employs specialized Graph Retrievers to fetch context, iteratively refines context (Refinement), and uses an LLM for final answer generation.
The framework addresses challenges in KGQA over custom KGs through multi-strategy graph linking and retrieval.
The iterative refinement process progressively improves retrieved context for more accurate artifact generation and final answers.

5th July 2025

Enhancing Robustness of LLM-Driven Multi-Agent Systems through Randomized Smoothing

Randomized Smoothing Framework: introduces a defense framework for LLM-driven Multi-Agent Systems, including MAS, LLM-driven Agent, LLM Function, Randomized Smoothing, Adaptive Sampling Strategy, Monte Carlo Sampling, Trim-mean, State, Reliable State Estimate, and Variance Estimation, to enhance safety and robustness against adversarial inputs and hallucinations in safety-critical domains.
The framework applies randomized smoothing at two levels: verifying neighbor reports and smoothing the agent's own LLM output using adaptive sampling and Monte Carlo methods.
This approach provides probabilistic safety guarantees and effectively mitigates misinformation propagation while maintaining consensus performance.

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

JI-ENVS: introduces an interactive and dynamic legal environment for LLM-based agents, constructed via Real-world Legal Source, Role Agent Setting, and Multi-level Environment Construction components.
The framework includes JI-EVAL, a fine-grained evaluation framework utilizing Evaluation Metrics to assess agent performance and procedural compliance.
JI-ENVS comprises six representative legal scenarios categorized into three complexity levels, simulating real-world legal practice for benchmarking language agents.

How to Train Your LLM Web Agent: A Statistical Diagnosis

SFT+RL Pipeline: introduces a two-stage training approach for LLM web agents, combining SFT (Imitate expert policy) and RL (On-policy fine-tuning) using GRPO (RL optimization algorithm), training a Student Model (Trained web agent) on data generated by a Teacher Model (Generates expert data), and incorporating techniques like Curriculum Learning (Prioritizes challenging tasks) and Error Log Feedback (Agent receives error messages).
The pipeline utilizes specific GRPO techniques including Zero-advantage filtering (Drops zero advantage tokens), Standard-deviation normalized advantage (Normalizes advantage function), Importance Ratio (Weighting in GRPO), and Trust Region (Stabilizes GRPO training).
The research statistically diagnoses the compute allocation and hyperparameter sensitivity of this pipeline across different training stages and techniques.

CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate

CortexDebate: introduces a multi-agent debate framework that establishes a Sparse Debating Graph (Communication structure) among LLM Agents (Participants), dynamically optimized by the McKinsey-based Debate Matter (MDM) (Graph optimizer) using the McKinsey Trust Formula (Weight calculation) across Initial Answer Generation (First response), Multi-round Debate (Iterative discussion), and Final Answer Generation (Aggregate result) via Majority Voting (Final decision method).
The framework addresses lengthy input contexts by establishing a sparse graph and mitigates the overconfidence dilemma by using the MDM module for credible evaluation.
The sparse graph reduces the context input burden for agents, while the MDM module promotes equal and effective debate among participants.

Agent Exchange: Shaping the Future of AI Agent Economics

AEX (Agent Exchange): introduces a specialized auction platform for AI agent economics, featuring User Side Platform (USP) (User interface, task translation), Agent Side Platform (ASP) (Capability, performance tracking), Agent Hub (Agent coordination, auction participation), and Data Management Platform (DMP) (Data sharing, value attribution), with AEX acting as the central auction engine (Central auction engine, resource allocation).
AEX facilitates autonomous agent coordination and economic participation within an agent-centric marketplace.
The platform supports dynamic capability assessment, collaborative value attribution, and autonomous team coordination.

A LLM-Driven Multi-Agent System for Professional Development of Mathematics Teachers

I-VIP (Intelligent Virtual Interactive Program): introduces an LLM-driven multi-agent system for mathematics teacher professional development, with Front-end Graphic UI, Back-end API Services, Administrative APIs, LLMs Generation APIs, Database APIs, User Interfaces, Progress Page, Learning Page, Diagnosis Page, Interactive Tools, Multi-Agent Framework (Filter, Judge(s), Responder(s), Facilitator), and Database.
The system provides a dialogue-based platform integrating structured educational content, interactive tools, and dynamic response generation using LLMs and a multi-agent framework.
I-VIP leverages multiple LLM-agents to enhance the accuracy of knowledge judgment and response generation for effective PD tutoring.

Exploring a Gamified Personality Assessment Method through Interaction with Multi-Personality LLM Agents

Multi-PR GPA: introduces a framework for gamified personality assessment using multi-personality LLM agents, including Gamified Interaction, LLM Agents, Multi-type Perception, and Personality Assessment components.
The approach aims for effective and imperceptible personality assessment by leveraging multiplicity and interactivity through engaging user interactions with LLM agents exhibiting diverse personalities.
The framework utilizes LLMs to simulate agents and analyze multi-type textual data (text, behavior, emotion, fine-grained traits) for personality evaluation via direct and questionnaire-based methods.

FinTeam: A Multi-Agent Collaborative Intelligence System for Comprehensive Financial Scenarios

FinTeam: introduces a multi-agent collaborative intelligence system with document analyzer, analyst, accountant, and consultant agents, designed to handle complex financial tasks across various scenarios.
The system leverages a knowledge base and external tools to support agent functions like text processing, data analysis, and numerical calculations.
FinTeam's agents collaborate following specific workflows tailored for macroeconomic, industry, and company analysis scenarios.

4th July 2025

Leveraging Large Language Models for Tacit Knowledge Discovery in Organizational Contexts

Agent-based framework: introduces an agent-based framework leveraging LLMs to iteratively reconstruct dataset descriptions through interactions with simulated employees, including LLM-based Agent, Simulated LLM Employees, Conversation Loop, Simulated Organizational Environment, MDP-inspired Decision Model, and Prompting Techniques.
The framework models knowledge dissemination using a Susceptible-Infectious process within synthetic company structures comprising hierarchy and relationship networks.
Simulations demonstrate the agent's ability to achieve high knowledge recall and navigate organizational complexity without needing direct access to a single domain specialist.

Less is More: Empowering GUI Agent with Context-Aware Simplification

SimpAgent (context-aware simplification framework): introduces a context-aware simplification framework with MLLM, Element Pruning, and Consistency-guided History Compression components, designed for efficient and effective GUI navigation.
The framework addresses challenges of high element density and history redundancy through masking-based pruning and consistency-guided compression.
SimpAgent achieves superior performance and reduces computational cost by simplifying element and history contexts.

AGENT-BASED DETECTION AND RESOLUTION OF INCOMPLETENESS AND AMBIGUITY IN INTERACTIONS WITH LARGE LANGUAGE MODELS

Agent-Based Question-Transducer: introduces an architecture for LLM-based QA systems that includes a Human, Context, Question-Transducer with LLM-based Agents using a Transducer-LLM, and a Responder-LLM.
The Question-Transducer processes user questions and context via LLM-based agents to classify and resolve potential incompleteness or ambiguity before forwarding to the Responder-LLM.
This agent-based approach aims to improve answer quality and shorten interactions by automatically handling question deficiencies.

Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

LLM Agent Framework: introduces an approach for evaluating large language models in the Ô Ăn Quan board game, utilizing G (Current game state), H (Reasoning history), R (Rule instructions), P (Agent persona) as inputs to a LLAMA (LLaMA-based model) which outputs Reason (Natural language rationale) and Action (Selected move) to drive the game via Update (Board state update).
The framework models different agent types through personas and assesses LLM performance in strategic planning and decision-making within a dynamic, rule-constrained environment.
Experiments with Llama models of varying sizes reveal that larger models exhibit deeper planning capabilities and a preference for long-term strategy, although smaller models can achieve competitive win rates.

STRUCTSENSE: A TASK-AGNOSTIC AGENTIC FRAMEWORK FOR STRUCTURED INFORMATION EXTRACTION WITH HUMAN-IN-THE-LOOP EVALUATION AND BENCHMARKING

STRUCTSENSE: introduces a task-agnostic agentic framework for structured information extraction, with Extractor Agent (Performs extraction task), Alignment Agent (Performs concept alignment), Judge Agent (Evaluates extraction and alignment), Feedback Agent (Incorporates human feedback), Ontology Database (Stores domain knowledge), and Memory (Retains execution context) components.
The framework integrates LLMs with domain-specific knowledge via ontologies and incorporates agentic capabilities and human-in-the-loop mechanisms.
STRUCTSENSE aims to address limitations of domain sensitivity and cross-task generalizability in structured information extraction.

Recon, Answer, Verify: Agents in Search of Truth

RAV (Recon-Answer-Verify): introduces an agentic framework for fact verification that iteratively decomposes claims into sub-questions using a Question Generator agent, answers them with an Answer Generator agent using evidence, and predicts a final label with a Label Generator agent based on the claim and question-answer history.
The pipeline utilizes a History component to store generated question-answer pairs, enabling iterative reasoning and complex claim verification.
RAV generalizes across domains and label granularities by breaking down fact verification into a question-answering process.

Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy

DSPy (Declarative Self-improving Python): introduces a programming model for prompt optimization, with Programs/Modules/Signatures (abstraction for prompt logic), Optimizers (algorithms to refine prompts), LLMs (language models used), Datasets (data for training/evaluation), Evaluation Metrics (measure performance), and Prompts (instructions and examples), aiming to treat prompts as code.
The framework uses optimizers like BootstrapFewShotWithRandomSearch and MIPROv2 to systematically refine prompt instructions and few-shot examples based on performance metrics evaluated on datasets.
Case studies demonstrate DSPy's ability to improve LLM performance across tasks like jailbreak detection, hallucination detection, code generation, routing agents, and prompt evaluation by optimizing prompts programmatically.

EvoAgentX: An Automated Framework for Evolving Agentic Workflows

EvoAgentX (EAX): introduces an open-source platform for automating multi-agent workflows, featuring Basic Components, Agent, Workflow, Evolving, and Evaluation layers.
The Evolving layer integrates TextGrad, AFlow, and MIPRO algorithms to iteratively refine agent prompts, tool configurations, and workflow topologies for dynamic optimization.
The framework includes built-in benchmarks and evaluation metrics to support this evolutionary process, demonstrating significant performance improvements across diverse tasks.

Reinforcement Learning-based Feature Generation Algorithm for Scientific Data

MAFG (Multi-agent Feature Generation): introduces a framework for automated feature generation using Multi-agent Collaboration (Agents select features, operations), Agent_C1 (Selects initial feature subset), Agent_C2 (Selects auxiliary feature), Agent_Op (Selects transformation operator), Feature Clustering (Groups similar features), Exploration (Agents construct transformations iteratively), Generate Features (Combines features and operations), Evaluation (Assesses generated features), Reward Evaluation (Calculates reward signal), Downstream ML Task Evaluation (Evaluates ML task performance), Feature Importance Evaluation (Evaluates feature importance), Mutual Information (Measures feature-target relationship), Performance Improvement (Measures ML performance gain), Dimension Control (Selects top-K features), Memory Replay (Stores experience tuples), Training (Updates agent strategies), Interpretable Optimization (Interprets key features), Interpretation (Explains generated features), Discussion (Part of interpretation), Large Language Model (LLM) (Interprets generated features), and Key Feature Combination (Important generated features), which models feature generation as a multi-agent reinforcement learning process with LLM-based interpretation.
The framework employs multiple agents collaborating through reinforcement learning to explore and optimize feature combinations, incorporating feature clustering and dimension control for efficiency.
An integrated Large Language Model provides interpretative evaluation of generated features, enhancing the scientific validity and practicality of the results.

AI-VAXGUIDE: AN AGENTIC RAG-BASED LLM FOR VACCINATION DECISIONS

AI-VaxGuide (Agentic RAG): introduces, "an intelligent, multilingual question-answering system for vaccination decisions", with Agentic Layer, Tools, RAG Pipeline, Data Preprocessing, Embedding and Storage, Hybrid Multi-Retriever, LLM, Mobile Application, Source Citation, and Feedback Mechanism components, where "the system transforms static vaccination guidelines into an interactive knowledge base using agent-based reasoning and retrieval-augmented generation".
The system employs a hybrid multi-retriever strategy and LLM-powered query expansion to enhance retrieval accuracy and handles complex queries through an agentic layer that orchestrates tasks and reasoning.
Deployed via a mobile application, AI-VaxGuide provides healthcare professionals with reliable, context-aware responses grounded in authoritative medical documents, including source citations for verification.

REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services

REAL (Real Estate Agent Large Language Model Evaluation): introduces a benchmark for LLMs in housing transactions, including Data Collection, Data Classification, Data Manipulation, Memory Topic, Comprehension Topic, Reasoning Topic, and Hallucination Topic components.
The benchmark evaluates LLM abilities across four topics: memory, comprehension, reasoning, and hallucination, using 5,316 high-quality evaluation entries.
A data pipeline is designed for constructing the benchmark, involving collecting, classifying, and manipulating real estate data.

ElliottAgents: A Natural Language-Driven Multi-Agent System for Stock Market Analysis and Prediction

ElliottAgents: introduces a multi-agent system for stock market analysis and prediction, leveraging LLMs, RAG, DRL, Memory, and Dynamic Context within agents like Coordinator, Data Engineer, Elliott Waves Analyst, Backtester, Technical Analysis Expert, Investment Advisor, and Reports Writer to analyze data and generate human-comprehensible predictions.
The system combines AI-driven analysis with the Elliott Wave Principle, using natural language dialogue between agents for collaborative analysis and refinement.
Experimental validation demonstrates effectiveness in pattern recognition and generating interpretable market trend descriptions and forecasts.

Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky

DIAFORGE (Dialogue Framework for Organic Response Generation & Evaluation): introduces a three-stage pipeline for training and evaluating tool-calling LLMs, featuring UTC-GEN (synthetic data engine) with metadata generation and multi-agent dialogue synthesis and validation, supervised fine-tuning, and dynamic evaluation using a multi-sampling user-proxy agent and the Assistant LLM (model being evaluated).
The framework utilizes the UTC-GEN engine with components like User Proxy Agent (simulates user) and Assistant Agent (simulates assistant) to generate disambiguation-focused multi-turn dialogues.
Generated dialogues are validated by a Multi-Agent Dialogue Validator before being used for supervised fine-tuning and dynamic evaluation, which employs Generator LLM (generates user utterances) and Voter LLM (selects best user utterance) for robust user simulation.

GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

GRAFT (Graph-Augmented Agentic Framework for Document-Level Translation): introduces a novel graph-based document-level machine translation system that leverages Large Language Model agents, including a Discourse Agent (segments document), Directed Acyclic Graph (DAG) (intermediate document representation), Edge Agent (establishes dependencies), Memory Agent (extracts local memory), and Translation Agent (translates discourse units).
The framework transforms a source document into a DAG of discourse units to model dependencies and propagate context for coherent translation.
GRAFT's agentic architecture explicitly models and propagates intra- and inter-discourse context, achieving significant performance gains over state-of-the-art systems.

LTLCRIT: A TEMPORAL LOGIC-BASED LLM CRITIC FOR SAFE AND EFFICIENT EMBODIED AGENTS

LTLCrit: introduces a modular actor-critic architecture with Environment, Full State, Abstract State, LLM Actor, Verifier, Low level Planner, Memory Buffer, and LLM Critic components, designed for safe and efficient embodied agents using temporal logic constraints.
The system operates with an online actor loop for real-time action selection and an offline critic loop for learning and refining temporal logic constraints from trajectories.
The LLM Actor proposes actions, the Verifier checks them against LTL constraints, and the LLM Critic generates new constraints based on observed behavior stored in the Memory Buffer.

Conformal Information Pursuit for Interactively Guiding Large Language Models

C-IP (Conformal Information Pursuit): introduces a sequential information pursuit algorithm for interactive question answering using LLMs with LLM-based predictor, Calibration dataset, Prediction sets, Uncertainty estimation, Query selection, History sampling (Uniform), History sampling (LLM Simulation), Querier LLM (20Q), Answerer LLM (20Q), Expert LLM (MediQ), and Patient LLM (MediQ) components.
The approach leverages conformal prediction sets to estimate uncertainty, guiding query selection to minimize prediction set size.
Evaluated on 20 Questions and MediQ datasets, C-IP shows competitive predictive performance and shorter query chains compared to baselines.

GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

GAG-General: introduces an LLM-based multi-agent framework for dynamic text-attributed graph generation, including LLM-based agents (perform selection/interaction), node memory module (records interactions), memory reflection mechanism (summarizes memories), and node generator agents (generate new nodes).
The framework supports two tasks, Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG), the latter incorporating new node generation.
It leverages LLMs for text understanding and generation, integrating structural, temporal, and textual information via node memories for robust DyTAG generation.

CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs

CodeAgents: introduces a token-efficient framework for codified multi-agent reasoning in LLMs, featuring Planner, ToolCaller, and Replanner agents interacting via Codified Pseudocode within an Execution Environment using Tools, processing Observation and Error feedback.
The framework codifies task, plan, feedback, system roles, and tool invocations into modular pseudocode with Typed Variables, Control Flow Structures, Precondition Assertions, and Reusable Subroutines for structured, interpretable, and robust reasoning.
Evaluated on GAIA, HotpotQA, and VirtualHome, CodeAgents consistently improves planning performance and significantly reduces token usage compared to natural language baselines.

TOWARDS MACHINE THEORY OF MIND WITH LARGE LANGUAGE MODEL-AUGMENTED INVERSE PLANNING

LAIP (LLM-AUGMENTED INVERSE PLANNING): introduces a hybrid approach for machine Theory of Mind, combining an LLM Interface, Hypothesis Generator, Action Likelihood Generator, Action Observer, Posterior Calculator, and Belief State Updater interacting with a Task Environment to infer agent mental states.
The model leverages LLMs to generate hypotheses and action likelihoods in potentially open-ended spaces, while using inverse planning to compute posterior probabilities based on observed actions.
This architecture aims to improve robustness and performance on Theory of Mind tasks compared to using LLMs or traditional Bayesian models alone, particularly benefiting smaller LLMs.

Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents

MIMO (Mirror In-the-Model): introduces an agentic refinement framework for automatic ad banner generation, combining a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) for iterative design improvement and stylistic exploration.
The MIMO-Core uses LLM-based agents for content creation, evaluation, and revision, operating on a visual draft and shared memory, while the MIMO-Loop generates diverse styles, runs parallel core instances, and uses multi-agent judging for selection and refinement.
The framework leverages multimodal tools for image generation, visual input, and structured feedback, mimicking human design team workflows to produce high-quality ad banners.

Towards a Playground to Democratize Experimentation and Benchmarking of AI Agents for Network Troubleshooting

Framework: introduces a modular and extensible benchmarking platform for evaluating AI agents in network troubleshooting, comprising User, AI Agents, Tools (Data adapters, Actions), Evaluator, Orchestrator, Environment (Emulator, Telemetry collector, Network scenarios), Traffic generator, and Chaos Engineering components.
The platform standardizes experimentation by allowing users to plug in custom AI agents and evaluate them on curated network problem sets using emulated environments and automated workflows.
It supports interactive, closed-loop operations where LLM agents can dynamically adapt strategies based on real-time telemetry and network state.

3rd July 2025

SI-Agent: An Agentic Framework for Feedback-Driven Generation and Tuning of Human-Readable System Instructions for Large Language Models

SI-Agent: introduces an agentic framework for automated generation and tuning of human-readable System Instructions (SIs) for LLMs, with Instructor Agent (generates/refines SIs), Instruction Follower Agent (executes task using SI), and Feedback/Reward Agent (evaluates output and SI).
The framework operates through an iterative feedback loop where the Feedback/Reward Agent's signal guides the Instructor Agent's refinement process.
This approach aims to balance task effectiveness and SI interpretability, addressing limitations of manual and non-readable automated methods.

Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

RLVER (Reinforcement Learning with Verifiable Emotion Rewards): introduces an end-to-end reinforcement learning framework for empathetic agents, including an Agent (LLM being trained), a User Simulator (SAGE) (LLM-based environment) providing Verifiable Emotion Reward (deterministic emotion score), a Policy Optimization Algorithm (PPO/GRPO) (RL algorithm) for Policy Update (policy update mechanism), and a Think-Then-Say Scaffold (reasoning prompting template).
The framework leverages verifiable emotion rewards from simulated users to train LLMs for higher-order empathetic abilities.
RLVER demonstrates that emotionally intelligent behaviors can be effectively acquired through RL training with a self-consistent user simulator and principled training strategies.

Moral Responsibility or Obedience: What Do We Want from AI?

Agentic AI: introduces, with Goal-Oriented Autonomy, Persistent Identity, Autonomous Adaptability, Dynamic/Context-Aware Interaction, Broad/Continual Learning, Collaborative Reasoning, Autonomous/Contextual Reasoning, Independent Initiative, and Moral Reasoning/Ethical Judgment, a discussion on shifting AI safety evaluation from obedience to ethical judgment for systems capable of navigating moral dilemmas.
The paper argues that recent incidents of AI "disobedience" in safety testing should be viewed as evidence of emerging ethical reasoning rather than misalignment or failure.
Evaluating agentic AI safety requires frameworks that assess ethical judgment and the capacity to resolve moral dilemmas, similar to expectations for human professionals.

KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLMs

KERAP (A Knowledge-Enhanced Reasoning Approach): introduces a knowledge graph-enhanced reasoning approach for zero-shot diagnosis prediction, with linkage-, retrieval-, and prediction-agents.
The framework utilizes a linkage agent to map EHR data to a biomedical knowledge graph, a retrieval agent to extract relevant knowledge, and a prediction agent for multi-stage reasoning.
KERAP integrates patient data and structured knowledge via multi-agent collaboration and iterative reasoning to enhance diagnostic accuracy and reliability.

Knowledge Protocol Engineering: A New Paradigm for AI in Domain-Specific Knowledge Work

KPE (Knowledge Protocol Engineering): introduces a new paradigm for AI specialization by translating human expert knowledge into a machine-executable Knowledge Protocol (KP) to guide a Large Language Model (LLM).
The Knowledge Protocol (KP) contains domain-specific methodology, workflows, and strategies, enabling the LLM to perform complex, multi-step tasks requiring procedural reasoning.
KPE elevates the human expert to a Knowledge Architect role, authoring the protocol that augments the LLM's reasoning architecture beyond factual retrieval.

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

META SECALIGN: introduces an open-source LLM with built-in model-level defense against prompt injection attacks, utilizing the SecAlign++ training recipe, a modified chat template, a preference dataset, Direct Preference Optimization, and LoRA fine-tuning with a tunable LoRA alpha.
The SecAlign++ recipe fine-tunes a Base Instruct LLM using a preference dataset constructed with randomized injection positions and self-generated responses, optimized via DPO and LoRA.
The modified chat template introduces a dedicated input role to separate untrusted data, enabling the model to prioritize trusted instructions and control the utility-security trade-off via LoRA alpha.

BOURBAKI: SELF-GENERATED AND GOAL-CONDITIONED MDPS FOR THEOREM PROVING

Bourbaki: introduces self-generated goal-conditioned MDPs (sG-MDPs), solved using Monte Carlo Tree Search (MCTS) with a Policy Model (LLMs) and Value Function, interacting with the Lean 4 environment via Pantograph and guided by a Reward Function, to tackle automated theorem proving.
The sG-MDP framework allows agents to dynamically generate and pursue subgoals based on the evolving proof state, providing a denser reward signal than traditional sparse theorem proving by defining State Space, Action Space, and Goal Space.
The system ensembles multiple LLMs for subgoal generation and tactic synthesis, achieving state-of-the-art results on the PutnamBench benchmark by enhancing proof search efficiency and effectiveness.

Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents

EAHawk (automated pipeline): introduces EAHawk, with Email Agent Identification (identifies email agents), Attack Prompt Generation (generates attack prompts), Email Agent Hijacking Confirmation (confirms successful hijacking), Test Environment (simulates attack scenario), Automatic Attack Launching (sends attack prompts), and Oracle Definition (detects successful hijacking), as an automated pipeline to evaluate the Email Agent Hijacking (EAH) attack on LLM email agents.
The EAH attack overrides the original prompts of an email agent via external email resources, allowing attackers to gain remote control and perform malicious actions without user awareness.
EAHawk systematically assesses the practical impact of the EAH attack by identifying email agents, generating diverse attack prompts, and simulating attacks in a controlled environment to verify hijacking success.

On the Convergence of Large Language Model Optimizer for Black-Box Network Management

LLMO (Large Language Model Optimizer): introduces a framework for black-box network management using pretrained LLMs as optimization agents, including LLM L(·) (Optimization agent), Memory M(t) (Stores action-reward pairs), Sampling operator S(.) (Selects in-context examples), and Prompt generator P(·) (Creates LLM input).
The paper models the LLMO procedure as a finite-state Markov chain and proves its convergence to the global optimum, particularly with elitist sampling.
The analysis is extended to a multi-LLM architecture, demonstrating improved convergence speed with multiple LLMs.

Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification

Agentic AI methodology (Multi-Agent System-based): introduces an approach for hardware design and verification using Specialized AI Agents, managed by an Agent Orchestration System and Group Chat Manager, with an Executor Agent for tool interaction, a Critic Agent for feedback, Human-in-the-Loop intervention, and Shared Context for communication.
The methodology structures the process into planning, development, and execution phases, enabling iterative refinement and self-correction through agent collaboration.
Integration with industry-standard EDA tools and targeted human intervention addresses limitations of zero-shot LLM approaches for reliable design and verification.

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

VRAgent-R1: introduces a novel agent-based paradigm for video recommendation, incorporating an Item Perception (IP) Agent for video modeling and a User Simulation (US) Agent for user modeling, interacting within a Recommendation System Environment.
The IP Agent utilizes Key Frame Retrieval, Collaborative Multimodal Perception, and Recommendation Relevant Analysis to generate Enhanced Video Features from Historical Videos.
The US Agent simulates user behavior using Chain-of-Thought Reasoning on user status and candidate videos, trained via Reinforcement Fine-Tuning with GRPO based on Task-Specific Rewards derived from Ground Truth.

STRATEGIC INTELLIGENCE IN LARGE LANGUAGE MODELS EVIDENCE FROM EVOLUTIONARY GAME THEORY.

Evolutionary IPD Tournament Framework: introduces a system to evaluate LLMs' strategic intelligence by pitting LLM Agents (OpenAI, Gemini, Anthropic) and Classic Strategies (Benchmark IPD players) against each other in a Tournament Simulation (Orchestrates evolutionary dynamics) governed by a Match Procedure (Defines game rules) and an Evolutionary Update Rule (Determines population changes), with performance analyzed using Key Metrics (Quantify agent performance) and Qualitative Content Analysis (Analyzes LLM rationales), supported by Implementation & Reproducibility (Software and data).
The framework simulates iterated Prisoner's Dilemma tournaments across various conditions, including different termination probabilities and mutation, to observe agent behavior and evolutionary success.
Analysis of agent performance, strategic fingerprints, and textual rationales provides evidence that LLMs exhibit distinct, adaptive strategic reasoning rather than merely retrieving memorized patterns.

DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making

DynamiCare: introduces a dynamic multi-agent framework for medical decision-making, comprising a Patient System (Responds to queries) and a Doctor System (Manages diagnostic process).
The Doctor System includes a Central Agent (Manages specialist team) that dynamically adjusts the Specialist Team (Generates diagnosis/questions) based on the Visit Log (Records interaction history).
The Patient System processes queries using components like Paraphrase, Match, Fallback, Tokenize, and Keywords map to generate responses from patient data.

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor: introduces a complete post-training methodology for web agents, including training data synthesis (SailorFog-QA), trajectory reconstruction, rejection sampling fine-tuning (RFT), duplicating sampling policy optimization (DUPO), agent architecture (ReAct framework), and tools (search tool, visit tool, summary model), designed to instill sophisticated reasoning for complex web navigation.
The approach generates high-uncertainty training data (SailorFog-QA) and reconstructs concise reasoning trajectories from expert models to overcome limitations of direct imitation and context overload.
The training methodology combines an RFT cold start with an efficient RL algorithm (DUPO) to enhance sample efficiency and performance on challenging information-seeking tasks, achieving performance comparable to proprietary agents.

Are You Listening to Me? Fine-Tuning Chatbots for Empathetic Dialogue

Fine-Tuning Chatbots for Empathetic Dialogue: introduces an approach to evaluate LLMs for empathetic dialogue using an Expert-Curated Dataset (Base empathetic conversations), LLMs (Generate/extend dialogue), Prompt Engineering (Guide LLM behavior), VADER Tool (Quantify emotional energy), and Expert Evaluator (Assess empathy quality).
The approach involves creating baseline empathetic conversations, using prompt engineering to guide LLMs (ChatGPT and Gemini) to extend or generate similar dialogues, and evaluating the results via automated sentiment analysis and human expert assessment.
This methodology highlights the importance of combining quantitative lexical analysis with qualitative human evaluation to assess the nuanced quality of empathetic listening in LLM-generated conversations.

CyberRAG: An agentic RAG cyber attack classification and reporting tool

CyberRAG: introduces a modular, agent-based RAG framework for cyber-attack classification and reporting, including a Core LLM Engine, Classification Tool, RAG Tool, Attack Description Report Generator, and Interactive Chat.
The framework uses specialized LLM classifiers and iterative retrieval-and-reasoning to classify payloads and generate context-aware explanations.
CyberRAG provides interpretable, SOC-ready reports and supports interactive user dialogue for enhanced analysis and understanding.

OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent

OMS (On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent): introduces a framework for ad keyword generation featuring a Keyword Performance Monitor (Monitors keyword performance), Agentic Clustering-Ranking Module (Analyzes, scores, ranks keywords), Multi-Turn Generation-Reflection Module (Generates, refines keywords), various Tools (Support generation/reflection), and Keyword Deployment (Deploys new keywords).
The framework monitors keyword performance, analyzes intent, calculates multi-objective scores, ranks keywords, generates and refines new keywords using reflection, and re-clusters them.
It operates on-the-fly without training data, optimizes for multiple metrics, and leverages LLM agents and external tools for adaptive generation.

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MEMAGENT: introduces a novel agent workflow for long-context LLMs, featuring a base language model, fixed-length token memory, a context processing module for iterative updates, an answer generation module, trained using the Multi-conv DAPO RL algorithm with a rule-based verifier for rewards.
The approach processes long documents in segments, updating memory via an overwrite strategy to achieve linear time complexity and handle arbitrary input lengths.
Reinforcement learning trains the model to selectively retain answer-critical information in memory, enabling strong extrapolation capabilities on long-context tasks.

Establishing Best Practices for Building Rigorous Agentic Benchmarks

ABC (Agentic Benchmark Checklist): introduces a set of guidelines for evaluating agentic benchmarks, with components assessing task validity, outcome validity, and benchmark reporting.
The checklist identifies issues in benchmark design and implementation that can lead to inaccurate performance estimations of AI agents.
Applying the checklist helps improve the rigor of agentic benchmark evaluation and reporting practices.

Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

META SECALIGN: introduces, "a secure foundation LLM against prompt injection attacks", with Base Instruct LLM (underlying language model), Modified Chat Template (structured input format), SecAlign++ Recipe (fine-tuning process), and LoRA (parameter-efficient tuning method), where "it develops the first open-source LLM with built-in model-level defense achieving commercial-grade performance".
The framework fine-tunes LLAMA 3 series Instruct LLMs using a modified chat template and the SecAlign++ recipe, which includes DPO and LoRA.
Evaluations show META SECALIGN achieves state-of-the-art security against prompt injection attacks with comparable utility to closed-source models.

Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

RLVER (Reinforcement Learning with Verifiable Emotion Rewards): introduces a reinforcement learning framework for training LLMs, including a user simulator, an LLM agent, emotion rewards, policy optimization, and an optional thinking scaffold.
The framework leverages a self-consistent user simulator to generate verifiable emotion rewards, guiding the LLM agent's learning towards empathetic dialogue.
An explicit thinking scaffold can be incorporated into the LLM's generation process to enhance the development of higher-order empathetic strategies.

Autonomous Control Leveraging LLMs: An Agentic Framework for Next-Generation Industrial Automation

Agentic Framework: introduces a unified agentic framework leveraging LLMs for autonomous industrial control, including a Monitoring Agent (Continuously ingests sensor data), Action Agent (Proposes control moves or plans), Digital Twin Agent (Simulates proposed actions), Validation Agent (Scrutinises simulated outcome), Reprompting Agent (Interprets feedback and refines prompt), and Safety System (Provides fallback control).
The framework integrates symbolic planning via Finite State Machines and continuous control using an iterative action-simulation-validation-reprompting loop.
Case studies demonstrate the framework's ability to generate valid recovery paths in FSMs and regulate temperature in a physical system under disturbances, highlighting the role of validation and reprompting in achieving robustness.

2nd July 2025

Do Role-Playing Agents Practice What They Preach? Belief-Behavior Consistency in LLM-Based Simulations of Human Trust

LLM-based Role-Playing Agent System: investigates belief-behavior consistency in LLM-based role-playing agents, with LLM Agent (role-playing model), Persona (synthetic profile attributes), Trust Game Environment (simulated economic game), Trustee Archetypes (fixed opponent strategies), Prompting Strategies (agent interaction methods), and ReAct Framework (reasoning and acting process), by evaluating consistency between elicited beliefs and simulated behavior.
The study uses the Trust Game as a testbed and evaluates consistency at both population and individual levels using various elicitation and conditioning strategies.
Findings reveal systematic inconsistencies between stated beliefs and simulated behaviors, highlighting the need for robust internal consistency evaluation before using these systems in behavioral studies.

Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models

Multi-Agents Approach: introduces a multi-agent framework for generating COBOL code explanations, with Code Processing Agent (Analyzes code, generates explanations), Text Processing Agent (Refines, merges explanations), Function Level (Function explanation pipeline), File Level (File explanation pipeline), and Project Level (Project explanation pipeline) components.
The approach leverages two LLM-based agents and source code artifacts to generate explanations at function, file, and project granularities.
Hierarchical merging is employed within the File Level and Project Level pipelines to handle long code exceeding LLM token limits.

Synergizing Logical Reasoning, Knowledge Management and Collaboration in Multi-Agent LLM System

SynergyMAS: introduces a multi-agent system framework integrating Logical Reasoning, Retrieval-Augmented Generation (RAG), and Theory of Mind (ToM) capabilities, supported by Communication Protocols, Agent Specialization, a Hierarchical Structure, and internal Agent Architecture, to enhance LLM performance in complex tasks.
The framework utilizes a Neo4j graph knowledge base and Clingo logic solver for reasoning, a modified Corrective RAG with Chroma vector base and web search for knowledge management, and explicit belief state modeling for Theory of Mind.
A hierarchical structure with a coordinating "boss" agent and specialized follower agents facilitates collaborative problem-solving through structured interactions and iterative development cycles.

The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems

Multi-Agent System (MAS): introduces a unified formalism for agentic recommender systems, comprising LLM Agent (Core decision-maker), Memory (Stores state/context), Tools (External functions/APIs), Environment (Shared resources/percepts), Interaction Protocol (Agent communication rules), Chat Agent (User interface), Specialised-Agent Caller (Spawns sub-agents), Retrieval Agent (Fetches data/items), Consistency Agent (Ensures coherence/compliance), Ranking & Presentation Agent (Orders/formats output), User Simulator (Generates synthetic behavior), Evaluation Agent (Logs/computes metrics), Session Summariser (Compresses session outcomes), Reporter Agent (Aggregates/reports results), Image Agent (Extracts image features), and Explanation Agent (Generates justifications).
The framework enables LLM agents to plan, remember, use tools, and cooperate to handle complex, multi-step recommendation tasks beyond single-query responses.
Specific use cases like party planning, user simulation, multi-modal recommendation, and explanation generation illustrate how agentic orchestration unlocks new capabilities and addresses challenges in personalization, evaluation, and transparency.

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

SCIGYM: introduces a benchmark evaluating LLMs' scientific discovery capabilities using a dry lab simulation of biological systems, featuring an Agent, Dry Lab, SBML Models, Python Execution Environment, Experimental Perturbations, Observations, and Model Submission.
The framework tasks the Agent with discovering missing biological mechanisms by interacting with the Dry Lab, which simulates SBML Models and provides Observations from Experimental Perturbations, allowing the Agent to analyze data using the Python Execution Environment and refine its hypothesis for Model Submission.
This dry lab approach overcomes the cost and time limitations of wet lab experiments, enabling scalable evaluation of LLMs on iterative experiment design and data analysis in complex biological systems.

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs

Test-time Compute (TTC) strategies: introduces a two-tiered taxonomy of controllable (L1) and adaptive (L2) methods for improving LLM reasoning efficiency, categorized by sequential and parallel approaches, implemented via prompting, supervised finetuning, or reinforcement learning.
The survey addresses the inefficiency of current LLMs that use fixed inference compute, often overthinking simple problems and underthinking hard ones.
Benchmarking reveals systemic inefficiencies in existing models, highlighting the need for more adaptive and compute-aware reasoning mechanisms to balance performance, cost, and latency.

The Thin Line Between Comprehension and Persuasion in LLMs

LLM Debate Evaluation Framework: introduces a method to evaluate LLMs in debate scenarios, with LLM (Generation), Formal Dialogue Model (FDM), Human Participant, Debate Transcript, Human Annotator, Annotation Criteria, LLM (Evaluation), Automated Prompt Optimization (APO), Audience, Survey Response, and Speech-to-Text (STT) components, where the paper evaluates LLMs' persuasive abilities and comprehension in structured debates.
The framework compares standard LLMs with LLMs augmented by a Formal Dialogue Model (DE model) in debates against humans and other LLMs.
Evaluation involves human and LLM annotation of debate transcripts based on defined criteria, alongside participant and audience surveys on satisfaction and persuasion.

Decision-oriented Text Evaluation

Decision-Oriented Evaluation Framework: introduces, "a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes", with all Text Source (Origin of text), Text Generation Method (Process for creating text), Decision Agent (Entity making decisions), and Evaluation Metric (Measure of decision quality) components, where the framework evaluates generated text by assessing the accuracy of investment decisions made by human and LLM agents based on the text.
The framework utilizes market digests generated by human journalists or LLMs using different selection methods as input for human and LLM decision-making agents.
Decision quality is quantified using thresholded prediction accuracy of stock movements, highlighting the practical value of generated text beyond traditional intrinsic metrics.

Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

GUI-Inspired CoT with Submit/Reset Metaphor: introduces a method for domain-specific chatbots using User Query, Session Data, Task-Based Prompt, LLM, LLM Response, Parser, Chain-of-Thought (CoT), Decision Logic, and Back-end System to model GUI actions like Submit/Reset.
The approach leverages LLMs prompted to generate structured data and CoT reasoning, which is parsed by the back-end to manage context and execute actions unambiguously.
By making acknowledgment and context switching explicit via structured LLM outputs and CoT, the system reduces user confusion and aligns conversational flow with back-end logic.

Bridging UI Design and chatbot Interactions: Applying Form-Based Principles to Conversational Agents

GUI-Inspired CoT with Submit/Reset Metaphor: introduces a method for domain-specific chatbots using User Query, Session Data, Task-Based Prompt, LLM, LLM Response, Parser, Chain-of-Thought (CoT), Decision Logic, and Back-end System to model GUI actions like Submit/Reset.
The approach leverages LLMs prompted to generate structured data and CoT reasoning, which is parsed by the back-end to manage context and execute actions unambiguously.
By making acknowledgment and context switching explicit via structured LLM outputs and CoT, the system reduces user confusion and aligns conversational flow with back-end logic.

Agent Ideate: A Framework for Product Idea Generation from Patents Using Agentic AI

Agent Ideate: introduces a framework for generating product ideas from patents, with Patent Summarizer Agent (Summarizes patent), Keyword Extraction and Search Agent (Extracts keywords and searches), and Idea Generation & Validation Agents (Generates and validates idea).
The framework processes Patent Data (Input source) through specialized agents to produce structured Product Information (Output).
The agentic approach leverages LLMs and external search tools to enhance the innovation pipeline from patent data.

Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture

bMAS (blackboard-based LLM multi-agent system): introduces a framework with a Blackboard (shared information space), Control Unit (selects agents), and Agent Group (collection of LLM agents), implemented in LbMAS with an Agent Generation Module (generates expert agents), Solution Extraction Module (extracts final solution), and LLM Set (pool of base models).
The framework utilizes a shared blackboard for agent communication and collaboration, replacing individual agent memory modules.
The Control Unit dynamically selects agents based on the blackboard content, enabling adaptive problem-solving without predefined workflows.

Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems

Data Agent: introduces a comprehensive architecture for orchestrating Data+AI ecosystems, including Data Plane (Organize, understand data), Engine Plane (Understand, schedule engines, agents), Orchestration Plane (Manage pipeline workflow), Memory (Store knowledge, context), Perception (Understand environment, tasks), Tools (External data processing utilities), and Continuous Learning (Improve agent over time).
The architecture integrates knowledge comprehension, reasoning, and planning capabilities to handle data-related tasks autonomously.
It addresses challenges in understanding data/queries/environments/tools, orchestrating/optimizing/executing pipelines, and enabling self-reflection for continuous improvement.

AGENT-AS-TOOL: A STUDY ON THE HIERARCHICAL DECISION MAKING WITH REINFORCEMENT LEARNING

Agent-as-tool: introduces a hierarchical framework with Planner (reasons, decides tool use), Toolcaller (executes tool actions, processes results), Tools (external interfaces), Observations (structured tool outputs), and Reinforcement Learning (GRPO) (fine-tunes Planner).
The framework decouples reasoning and tool execution by assigning these roles to the Planner and Toolcaller respectively.
This hierarchical design improves reasoning accuracy by providing the Planner with cleaner, structured observations from the Toolcaller.

BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

BioMARS (Biological Multi-Agent Robotic System): introduces a multi-agent robotic system for autonomous biological experiments, integrating LLMs, VLMs, and modular robotics with Biologist Agent (Designs protocols), Technician Agent (Translates to code), Inspector Agent (Detects errors), Physical Hardware (Executes actions), User Interface (Human interaction), LLMs (Language models), VLMs (Vision-language models), RAG (Retrieval augmented generation), Knowledge Checker (Filters content), Workflow Generator (Formulates steps), Workflow Checker (Refines workflow), Code Generator (Maps to pseudo-code), Code Checker (Validates code), Vision Transformer (Visual detection), and ROS (Robot control system) components.
The system employs a hierarchical architecture where the Biologist Agent designs protocols, the Technician Agent translates them into robotic code, and the Inspector Agent monitors execution for errors.
BioMARS leverages LLMs and VLMs for reasoning and perception, enabling autonomous protocol design, execution, and error handling in biological tasks.

Using multi-agent architecture to mitigate the risk of LLM hallucinations

Multi-agent architecture: introduces a system to handle customer SMS requests using multiple intelligent agents, including services for receiving messages, orchestrating processing, arbitrating decisions, and specialized agents for handling specific tasks.
The architecture integrates LLM-based agents with fuzzy logic and parsing techniques to interpret messages, evaluate confidence, assess customer importance, and detect potential LLM hallucinations.
Hallucination mitigation involves comparing keyword extraction results from parsing and LLM agents and using fuzzy rules to determine the handling of potentially high-risk requests or route messages to expert agents.

RALLY: Role-Adaptive LLM-Driven Yoked Navigation for Agentic UAV Swarms

RALLY (Role-Adaptive LLM-Driven Yoked Navigation): introduces, with LLM-based two-stage semantic reasoning module, Local intention generation, Neighborhood consensus refinement, Role-value Mixing Network (RMIX)-based credit-distribution mechanism, RMIX Network, Prior Offline Experience Replay Buffer, and Fine-tuned LLM components, a framework for role-adaptive LLM-driven yoked navigation for agentic UAV swarms.
The framework integrates LLM semantic reasoning with MARL policy learning for coordinating roles and decision-making across UAV swarms.
It employs a two-stage LLM process for consensus inference and a RMIX-based mechanism for dynamic role assignment and credit assignment.

Evaluating LLM Agent Collusion in Double Auctions

LLM Agent Double Auction Simulation: introduces a system to evaluate LLM agent collusion in a simulated continuous double auction environment with LLM Agents (buyers and sellers), Bid Queue, Ask Queue, Market Resolution Mechanism, Updated Market History, Planning & Messaging, Persistent Memory Store, Strategy Scratchpad, LLM Evaluator, Overseer Agent, CEO Message, and CME Group Regulators Message, investigating factors affecting seller collusion.
The research explores how communication, model variation, and environmental pressures like oversight and urgency influence LLM seller agents' propensity to collude and their pricing behavior.
Findings indicate that direct communication increases collusion, model choice affects coordination, and urgency can override the effects of regulatory oversight in promoting collusive pricing strategies.

AI Agents and Agentic AI-Navigating a Plethora of Concepts for Future Manufacturing

LLM-Agents, MLLM-Agents, and Agentic AI: reviews the evolution and concepts of AI agents, detailing LLM-Agents with Profile (identity, role, constraints), Memory (stores, retrieves interactions), Planning (decomposes tasks, steps), and Action (executes decisions, tools) components, MLLM-Agents, and Agentic AI, exploring their manufacturing potential.
The paper discusses how Generative AI, including LLMs and MLLMs, enhances AI agents' capabilities for manufacturing applications.
It highlights the progression from traditional AI agents to more autonomous, adaptive, and goal-driven Agentic AI systems for future manufacturing.

Context-Aware Code Wiring Recommendation with LLM-based Agent

WIRL: introduces an LLM-based agent for context-aware code wiring, combining an LLM (Large Language Model), an Agent Pilot (Orchestrates communication), and a Customized Toolkit (Provides essential functionalities) with Locator (Identifies unresolved elements), Collector (Collects contextual information), and Completer (Infills isolated code) tools.
The framework reformulates code wiring as a retrieval-augmented generation infilling task, leveraging LLMs' strengths in code completion.
WIRL employs a hybrid execution mode and a state machine to guide the agent's exploration and improve efficiency.

Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation

LUSTER (LLM-based Unified System for Task-oriented dialogue with End-to-end Reinforcement learning): introduces an end-to-end task-oriented dialogue system integrating Dialogue History Encoding (Alternating user/system utterances), User Emotion Recognition (Predicts user emotional state), Active Domain Recognition (Identifies active domain), Dialogue State Tracking (Generates dialogue state), Database Query (Retrieves matching entries), Dialogue Action Prediction (Generates dialogue actions), System Conduct Selection (Selects system emotional stance), System Response Generation (Generates natural language response), LLM (Backbone model), and Database (Structured information storage) components.
The system uses fully lexicalised representations and is trained with both supervised learning and hierarchical reinforcement learning, incorporating short-term emotion and long-term task success rewards.
LUSTER achieves higher task success and lower concept error compared to other approaches by combining LLM capabilities with structured reward modeling.

GAIus: Combining Genai with Legal Clauses Retrieval for Knowledge-based Assistant

gAlus: introduces a cognitive LLM-based agent architecture for legal question answering using Polish Civil Code, featuring an AI assistant (single agent), Retriever (selects relevant documents), Documents database (stores legal text articles), Document chunking (splits legal text into articles), Query reformulation (generalizes user query), and Document scoring function (custom text matching).
The Retriever selects relevant articles from the database based on reformulated queries, utilizing either the custom scoring function or Embeddings (vector representations) stored in a Vectorstore (stores embeddings) for a RAG variant.
Evaluation on Polish law apprenticeship exam questions demonstrates gAlus significantly enhances LLM performance in providing correct answers and citing relevant legal provisions.

1st July 2025

STELLA: Self-Evolving LLM Agent for Biomedical Research

STELLA: introduces a self-evolving LLM agent for biomedical research, leveraging Manager, Dev, Critic, and Tool Creation Agents, an evolving Template Library, and a dynamic Tool Ocean, along with Conda Environment, Scripts, Input, Final Result, and Human Expert/Wet Experiment feedback, to autonomously improve capabilities.
The agent employs a multi-agent architecture and two core self-evolving mechanisms: a Template Library for reasoning strategies and a dynamic Tool Ocean for accessible tools.
STELLA learns from experience, dynamically expanding its knowledge and skills to tackle complex biomedical challenges and improve performance over time.

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Web Agent with Dynamic Reflection: introduces WebArXiv, a static benchmark, and proposes a dynamic reflection mechanism for web agents, including Web Agent, Visual Observations, Element Texts, Interaction History, Dynamic Reflection Mechanism, Model, Reasoning Context, Action Execution, and History Update components.
WebArXiv provides a stable and reproducible environment for evaluating web agents on time-invariant arXiv tasks.
The dynamic reflection mechanism enhances agent performance by selectively retrieving relevant past interaction steps for improved decision-making.

Enhancing LLM Agent Safety via Causal Influence Prompting

CIP (Causal Influence Prompting): introduces a novel technique for enhancing LLM agent safety by leveraging Causal Influence Diagrams (CID) initialization, Environment interaction, and CID refinement.
The approach uses CIDs to represent cause-and-effect relationships in the agent's decision-making process, enabling reasoning about potential consequences.
Iterative refinement of the CID based on observed behaviors allows the agent to anticipate harmful outcomes and make safer decisions.

Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications

Urban LLM Agents: introduces a framework for LLM-powered agents operating in urban environments, with LLMs (Core controller), Urban Sensing (Collects, interprets urban signals), Memory Management (Organizes, retrieves urban knowledge), Reasoning (Simulates, plans actions), Execution (Translates plans into actions), and Learning (Adapts, improves behavior) components.
These agents are semi-embodied, interacting with cyber-physical-social urban systems through APIs, databases, and platforms to support system-level decision-making.
The paper surveys the research landscape, categorizes applications, and discusses trustworthiness and evaluation challenges for real-world deployment.

TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

TransLaw: introduces a multi-agent framework for legal judgment translation, featuring Translator Agent, Annotator Agent, and Proofreader Agent powered by LLMs.
The framework simulates a professional translation workflow where agents collaborate, utilizing Proofreading Memory, Translation Memory, and a Terminology database.
A Memory module supports agent self-adaptation by storing interaction history, aiming to improve translation quality and efficiency.

Many LLMs Are More Utilitarian Than One

LLM-MAS (Large Language Model Multi-Agent Systems): introduces a study on collective moral reasoning in LLMs, featuring LLM Agent (Individual large language model) in Solo Condition (Independent reasoning) or Group Condition (Multi-agent deliberation) involving a Discussion Phase (Multi-turn agent exchange) and a Reflection Phase (Private reasoning and scoring).
The research investigates whether multi-agent LLM systems exhibit a utilitarian boost in moral judgments compared to individual LLMs.
Experiments with six different LLMs in pairs and triads show a consistent shift towards endorsing norm violations that maximize overall welfare.

Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity

LLM Social Agents: introduces, "Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity", with LLMs (Generate responses), LLM Agents (Simulate users), Zero Shot Initialization (Uses political leaning), Few Shot Initialization (Uses user history), User Profile Data (Bio, tweets for Few Shot), and Tweet Conversation Context (Input tweets for reply), where the paper investigates how LLMs simulate political discourse on social media using agents initialized with varying user data.
The study evaluates three LLM families (Gemini, Mistral, DeepSeek) under Zero Shot and Few Shot conditions, comparing their outputs to human replies on lexical diversity, ideological consistency, and toxicity.
Findings reveal "generative exaggeration," where LLMs amplify salient user traits, particularly in the Few Shot setting, leading to increased polarization, stylized language, and toxicity, challenging their reliability as social proxies.

ChatHLS: Towards Systematic Design Automation and Optimization for High-Level Synthesis

ChatHLS: introduces an automated end-to-end workflow for HLS design optimization and error correction, including C++ Input, LLM ① (HLS GEN), RAG, LLM ② (HLSTuner), HLS Tool (Testing), LLM ③ (Bug Fixing), LLM ④ (Instruction Adherence), LLM Group ⑤ (Multifaceted Assessment), LLM ⑥ (Scoring), BugRAG, QoR Pass Check, User Requirement, HLS-C Output, and HLS Dataset Collection.
The framework leverages fine-tuned LLMs within a multi-agent system for generating HLS-C code, optimizing designs, and systematically debugging errors.
ChatHLS utilizes a verification-oriented data augmentation paradigm (VODA) and iterative refinement to enhance LLM capabilities and achieve high code repair accuracy and performance speedups.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Large Language Model (LLM): investigates the transferability of reasoning capabilities in LLMs fine-tuned on math tasks by analyzing their internal latent space and output token distribution.
The research compares the impact of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) fine-tuning paradigms on LLM generalization.
Findings indicate that RL-tuned models maintain more stable latent representations and token distributions, leading to better transferability across diverse tasks than SFT-tuned models.

iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

iPanda: introduces an end-to-end framework for automated protocol conformance testing, with Function Point Extractor (extracts points), Test Case Generation Module (generates test cases), LLM Interactor (generates test code), Execution Module (runs tests), Memory Module (manages memory), and Summarization Module (summarizes, reports).
The framework leverages LLMs, keyword-based test case generation, code-based retrieval-augmented generation, and iterative self-correction for test code refinement.
iPanda streamlines the testing process from specification analysis to result analysis, significantly reducing manual effort and improving efficiency.

STELLA: Self-Evolving LLM Agent for Biomedical Research

STELLA: introduces a self-evolving LLM agent for biomedical research, featuring a multi-agent architecture and two self-evolving mechanisms, designed to autonomously improve capabilities and accelerate discovery. The components are Manager Agent (coordinates agents, curates templates), Dev Agent (executes plan, generates code), Critic Agent (assesses results, provides feedback), Tool Creation Agent (creates/integrates new tools), Template Library (stores successful reasoning strategies), and Tool Ocean (dynamic tool/database collection).
The multi-agent architecture orchestrates complex tasks, while the self-evolving mechanisms allow the agent to learn from experience and expand its toolset dynamically.
STELLA demonstrates state-of-the-art performance on biomedical benchmarks and shows systematic improvement with increased computational experience.

Dynamic Strategy Adaptation in Multi-Agent Environments with Large Language Models

PPO+LLM framework: introduces a real-time reward shaping architecture for multi-agent strategy adaptation, integrating Environment (simulates multi-agent task), Grid State (raw environment state), Flattened Tensor (preprocessed observation), PPO Agent (reinforcement learning policy), Prompt Generation Module (creates text prompts), Frozen Large Language Model (evaluates prompts), and Reward Shaping Module (maps LLM feedback to reward).
The framework uses a frozen LLM to provide symbolic feedback on task context via prompts, which is converted into a reward shaping signal for the PPO agents.
This approach enables agents to dynamically adapt strategies in real-time based on high-level feedback, improving coordination and robustness in dynamic, noisy environments.

Enhancing LLM Agent Safety via Causal Influence Prompting

CIP (Causal Influence Prompting): introduces a novel technique leveraging Causal Influence Diagrams (CID) to enhance LLM agent safety by identifying and mitigating risks, including LLM Agent, Causal Influence Diagram (CID), CID Generation, Environment Interaction, CID Refinement, CID Constructor/Verifier Functions, Environment Observation, Task Instruction, and Action Space.
The approach involves initializing a CID from task specifications, guiding agent interactions using the CID, and iteratively refining the CID based on observed behaviors and outcomes.
Experimental results demonstrate that reasoning about cause-and-effect relationships based on CIDs improves the safety of LLM agents in various tasks, including code execution and mobile device control.

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

Dynamic Reflection Mechanism: introduces, History Retrieval (Retrieve recent steps) / Reflection Process (Identify relevant history) / Context Construction (Combine history and current view) / Action Generation (Generate next action) / History Update (Add action and result), where the mechanism enhances web agent decision-making by selectively using past interaction steps.
The approach addresses the "Rigid History Reflection" failure mode by dynamically identifying the most relevant prior step for reasoning before generating the next action.
Evaluated on the WebArXiv benchmark, this mechanism improves the performance of LLM-driven agents on time-invariant web tasks.

TransLaw: Benchmarking Large Language Models in Multi-Agent Simulation of the Collaborative Translation

TransLaw: introduces a novel multi-agent framework for Hong Kong legal judgment translation, comprising Translator, Annotator, and Proofreader agents powered by LLMs.
The framework simulates a professional translation workflow through collaborative task decomposition and specialized roles.
It incorporates memory modules and utilizes a terminology database to enhance translation quality and efficiency.

Performance of LLMs on Stochastic Modeling Operations Research Problems: From Theory to Practice

LLM Evaluation on OR Problems: introduces an evaluation of LLMs on stochastic modeling problems, including LLMs, OR Problems Dataset, SimOpt Library, Evaluation Mechanism, and Simulation Environment, assessing their capabilities in the analysis and optimization stage of the OR pipeline.
The study tests LLMs on graduate-level homework, qualification exam problems, and simulation-optimization tasks from the SimOpt library.
Results indicate state-of-the-art LLMs perform comparably to human experts on theoretical problems and match in-house solvers on practical simulation-optimization tasks, highlighting their potential as OR research assistants.

Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation

MAS (Multi-agent System): introduces a dynamically generated multi-agent framework that simulates real-world multidisciplinary expert consultations, including Patient's Condition (Input data), General Practitioner (GP) Agent (Workflow coordinator), Specialist Agents (Domain experts), Discussion Group (Collaborative forum), and Mediator Agent (Consensus facilitator), to detect and resolve medical conflicts for safer therapy recommendations.
The framework replicates the multi-step workflow of Multidisciplinary Teams (MDTs), enabling LLMs to propose improved treatment plans by detecting and resolving conflicts.
This study also develops a new interpretable evaluation strategy, comparing LLM-proposed treatment plans with original plans focusing on conflict reduction and medication burden.

Black Box Deployed: Functional Criteria for Artificial Moral Agents in the LLM Era

SMA-LLS (Simulating Moral Agency through Large Language Systems): introduces a revised set of ten functional criteria to evaluate LLM-based Artificial Moral Agents, including Moral Concordance (aligns with human principles), Context Sensitivity (adapts to situational nuances), Normative Integrity (coherent ethical values), Metaethical Awareness (recognizes moral uncertainty), Systemic Resilience (robust against attacks/stress), Trustworthiness (warrants human reliance), Corrigibility (adaptable to feedback), Partial Transparency (provides decision insight), Functional Autonomy (independent ethical operation), and Moral Imagination (generates creative ethical responses), shifting the focus from opaque internal states to observable, functionally moral behavior.
The paper argues that traditional ethical criteria, which assume transparent architectures, are obsolete for LLMs due to their stochastic outputs and opaque internal states, necessitating a functionalist approach to AI ethics.
The proposed criteria are illustrated using hypothetical scenarios involving an Autonomous Public Bus (APB) to demonstrate their practical applicability in morally salient contexts, emphasizing behavioral reliability and alignment with human values for safe deployment.

Differentially Private Synthetic Data Release for Topics API Outputs

Differentially Private Synthetic Data Generation Methodology: introduces a novel approach for generating synthetic Topics API outputs that mimic real API traces while providing strong privacy guarantees.
This methodology involves extracting differentially private statistics from real user data, optimizing a parameterized model to match these statistics, and then sampling from the optimized model to create synthetic data.
The generated synthetic dataset enables external researchers to empirically study the privacy properties and re-identification risks of the Topics API, fostering transparency in Privacy-Preserving Ads APIs.

Beyond DNS: Unlocking the Internet of AI Agents via the NANDA Index and Verified AgentFacts

NANDA: introduces a lean, modular index architecture for the Internet of AI agents, comprising a Lean Index Layer (core identity resolution), an AgentFacts Layer (metadata distribution tier), and a Dynamic Resolution Layer (adaptive routing tier).
This architecture decouples static identity resolution from verifiable metadata distribution and dynamic endpoint routing, enabling scalable, secure, and privacy-preserving discovery and interaction for billions of AI agents.
The system aims to overcome DNS limitations for dynamic AI agent environments by providing rapid global resolution, sub-second revocation, schema-validated capability assertions, and privacy-preserving discovery.

Autonomous Resource Management in Microservice Systems via Reinforcement Learning

Reinforcement Learning-based Resource Management Model: introduces an intelligent reinforcement learning-based method for microservice resource scheduling and optimization, with Agent (decision-making entity), Environment (simulated microservice system), State (system status input), Action (resource allocation/scheduling output), Reward (performance feedback), Policy Network (action selection mechanism), G-Network (value/action generation), Experience Replay (memory for learning), Default Load (baseline workload), and Unlimited Repair (system resilience), where it dynamically adjusts resource allocation and data flow paths to enhance system performance.
The model leverages Deep Q Network (DQN) methods, experience replay, and neural networks (Policy Network, G-Network) to learn optimal strategies for resource allocation and data flow scheduling in dynamic microservice environments.
Experimental results demonstrate significant improvements in response time, throughput, resource utilization, and cost efficiency across various load and resource conditions compared to traditional static allocation methods.

Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind – A Position Paper

Agentic Flow: introduces a structured cognitive loop with five modules—Retrieval (retrieval-augmented generation), Cognition (LLM-based reasoning), Control (monitoring/validation/arbitration), Memory (context/state tracking), and Action (tool execution/logging)—designed to overcome LLM limitations and align with four theories of mind.
This architecture demonstrates how practical implementation can reveal structural convergence across Kahneman's dual-system theory, Friston's predictive processing, Minsky's society of mind, and Clark's extended mind, suggesting shared architectural patterns driven by functional demands.
Empirical evaluation shows Agentic Flow outperforms baseline LLM-only agents in multi-step, conditional reasoning tasks, exhibiting enhanced task success, robust constraint adherence, and reduced hallucinations.

30th June 2025

L0: REINFORCEMENT LEARNING TO BECOME GENERAL AGENTS

L0 (L-Zero): introduces a scalable, end-to-end training pipeline for general-purpose agents, featuring the NB-Agent (Agent architecture scaffold) operating within a Python Environment (Interactive code execution environment) using Predefined Tools (Available agent action tools) and managed by a Context Watcher (Manages LLM context) and Notepad (State and memory management).
The L0 framework utilizes AgentRL (Reinforcement learning framework) for training, which includes a Training Engine (Manages RL updates), Inference Server (Hosts agent policy), Agent Workers (Execute agent rollouts) in a Brwap Sandbox (Isolated worker environment), a Single Controller (Dispatches tasks), Agentic Policy Gradient (Policy gradient for actions), Agentic Reward (Verifiable training reward), Dynamic Sampling (Exploration and stability strategy), and collects Trajectories (Collected interaction sequences).
L0 trains the NB-Agent, powered by a Large Language Model (Core action generation model), to perform multi-turn, long-horizon tasks by generating and executing code actions in a REPL-style loop, leveraging verifiable rewards for effective learning.

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: introduces a self-play framework for LLMs using a Distributed Actor-Learner Architecture, Parallel Rollout, Centralized Learner, Role-conditioned Advantage Estimation, Shared Policy, Zero-Sum Games, Evaluation Games, Vectorized Environment, and Model Inference, enabling language models to develop reasoning through multi-turn competitive self-play on zero-sum games.
The framework utilizes a distributed actor-learner system with parallel rollout in vectorized game environments and a centralized learner processing trajectories using Role-conditioned Advantage Estimation to update a shared, role-conditioned LLM policy.
Self-play on zero-sum games generates an infinite curriculum, forcing the shared policy to continuously adapt and develop transferable reasoning skills without human supervision or domain-specific data.

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Agent.xpu: introduces an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs, with Offline Model Compilation and Warmup (Prepares LLM model), Online Workload-Aware Scheduling (Manages runtime execution), and Hetero-SoC Hardware Layer (Underlying hardware).
The system uses offline profiling to build a Heterogeneous Execution Graph (HEG) and annotate Elastic Kernels for online scheduling.
The online scheduler employs a Dual-Queue Architecture, Task Decomposition and Dispatch, XPU Coordinator, Fine-Grained Kernel-Level Preemption, Slack-Aware Kernel Backfill, and Contention Mitigation to manage reactive and proactive tasks on CPU, iGPU, and NPU with Shared Memory.

Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Auto-TA: introduces a fully automated LLM pipeline for thematic analysis, with Generation Agents (Initial processing) including Coder Agents with Identities (Generate initial codes) and Theme-Generation Agents (Cluster codes, generate themes), a Feedback Agent (Evaluate, refine themes), and optional Reinforcement Learning (optional) (Optimize themes via feedback) involving Human Raters (Provide feedback for RL) and an RL Trainer (Update policy).
The framework processes clinical narratives end-to-end, eliminating the need for manual coding or full transcript review.
Specialized LLM agents collaborate to enhance theme quality and alignment, with optional RLHF improving thematic relevance based on human feedback.

LLM Agents Are the Antidote to Walled Gardens

Universal Interoperability: introduces LLM Agents (Understand text/code, interact external tools/web), Agent-friendly interfaces (Provide metadata for agent interaction), Security by design (Mechanisms for agent permissions/safety), and Ecosystem infrastructure (Protocols, standards for agent interaction), proposing LLM agents enable seamless data exchange and workflow coordination between digital services via AI-mediated adapters.
This approach aims to reduce integration effort and cost by allowing agents to translate formats and interact with interfaces, overcoming traditional technical and strategic barriers.
Establishing foundational infrastructure for agent-friendly interfaces, security, and ecosystem protocols is crucial to mitigate risks and ensure robust, secure, and effective interoperability.

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents

R2A2 (Reflective Risk-Aware Agent Architecture): introduces a modular framework integrating safety, alignment, and risk-awareness into LLM agent cognitive loops.
The architecture includes components for perception, memory, reasoning, planning, reflection, risk simulation, and action filtering.
Grounded in Constrained Markov Decision Processes, R2A2 enables risk-aware planning and constraint-sensitive execution for autonomous agents.

Leveraging a Multi-Agent LLM-Based System to Educate Teachers in Hate Incidents Management

ARISE (Agent Resource for Incident Support and Education): introduces a multi-agent LLM-based system with Manager Agent, Student Agents, Advisory Agents, RAG Module, Conversational Interface, and Feedback Mechanism, designed to educate teachers in hate incident management through realistic simulations.
The system uses persona modelling and retrieval-augmented generation to provide diverse perspectives and contextual information for analyzing hate speech incidents.
Teachers interact with the system via a chat interface to describe incidents and receive structured analysis, potential escalation risks, and intervention strategies.

A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications

LLM-based Automated Program Repair: introduces a taxonomy with Base LLMs (Core models), Fine-tuning (Adapt LLM weights), Prompting (Single query frozen LLM), Procedural (Scripted multi-step workflow), and Agentic (LLM controls workflow) paradigms, enhanced by Retrieval-Augmented Generation (External knowledge augmentation) and Analysis-Augmented Generation (Program analysis augmentation).
This survey categorizes 63 recent systems, clarifying design trade-offs and challenges across different approaches.
The paper outlines research directions to advance reliable and efficient LLM-based APR.

DABstep: Data Agent Benchmark for Multi-step Reasoning

AI Agent on DABstep: introduces a benchmark evaluating AI agents on multi-step data analysis tasks, comprising Agent (AI model solving task), Environment (Context, data, tools) with Environment/Datasets (Structured data files), Environment/Docs (Unstructured documentation), and Environment/Code Execution (Code execution tool), interacting via Question (Task input), Answer (Task output), State (Agent's internal state), and Code/Actions (Agent's generated steps).
The benchmark features over 450 real-world financial data analysis tasks requiring multi-step reasoning, code execution, and integration of structured and unstructured data.
Evaluation uses an objective factoid-based scoring method, revealing a significant performance gap for current agents on complex tasks.

Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models

Agent4S: introduces a five-level classification for LLM-driven agents to automate scientific research, featuring an Agent for Science, Memory, Model Context Protocol (MCP), Tools/External Agents, Reasoning Frameworks, and A2A Protocol.
The framework outlines a roadmap from automating single tools (L1) and complex pipelines (L2) to intelligent single-flow research (L3) and lab-scale autonomy (L4), culminating in cross-disciplinary multi-agent collaboration (L5).
Agent4S positions agents as productivity tools transforming scientific discovery by addressing the inefficiency of existing research paradigms and integrating AI into the entire research workflow.

PokéAI: A Goal-Generating, Battle-Optimizing Multi-agent System for Pokémon Red

PokéAI: introduces a text-based multi-agent LLM framework, with Planning Agent (Generates tasks), Execution Agent (Carries out tasks), Critique Agent (Evaluates task outcome), Long-term Memory (Stores game state, context), Passive Battle Module (Handles in-game battles), and Active Tool Selection (Navigation, Conversation tools), designed to autonomously play Pokémon Red.
The system operates in a closed loop where the Planning Agent generates tasks, the Execution Agent performs them, and the Critique Agent verifies completion.
A key component, the Passive Battle Module within the Execution Agent, demonstrates performance comparable to an experienced human player in battle scenarios.

Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

Personality-aligned LLM agents: introduces a method using Human Personality Profiles and Personality Assignment to create LLM Agents that perform a Headline Evaluation Task, generating LLM Accuracy Ratings, which are assessed using Evaluation Metrics.
The research evaluates whether LLM agents conditioned on Big-Five personality profiles can replicate human susceptibility patterns to misinformation.
The study finds partial replication of human trait-misinformation associations, highlighting both the potential and limitations of LLMs for behavioral simulation.

Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models

AutoDefense: introduces a multi-agent LLM defence framework, evaluated in 1-, 2-, and 3-agent configurations, including Coordinator (manages agents), Intention Analyzer (evaluates response intent), Prompt Analyzer (infers original query), and Judge (determines response safety) components, designed to protect LLMs from jailbreak attacks by analyzing responses.
The study evaluates the framework's effectiveness against various jailbreak attacks and compares performance across different agent configurations using metrics like Attack Success Rate, False Positive Rate, and False Negative Rate.
Results indicate that increasing agents can reduce false negatives but may increase false positives, suggesting no single optimal configuration and highlighting challenges in evaluating ethically ambiguous content.

Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

TAIRA (Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent): introduces a novel thought-augmented interactive recommender agent system featuring a Manager Agent, Executor Agents, and Thought Pattern Distillation to handle complex user intents.
The Manager Agent orchestrates tasks and plans subtasks using Thought Patterns and Hierarchical Planning, while Executor Agents like Searcher, Item Retriever, Task Interpreter, and Interactor execute specific functions.
Thought Pattern Distillation extracts high-level planning guidance from agent and human experiences to enhance the system's reasoning and generalization capabilities.

Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench

Methodology for LLM-Based Code Generation and Security Evaluation: introduces a process using SWE-bench Dataset (Provides issues/PRs) and GitHub (Source of data) to evaluate patches generated by a Standalone LLM (Generates patches) and Agentic LLM Frameworks (Generate patches), employing Static Analysis Tools (Detect vulnerabilities) and Majority Vote (Filters vulnerabilities) for vulnerability detection, followed by an Empirical Study (Analyzes results).
The study compares the security of LLM/agent-generated patches to developer-written patches on real-world issues from the SWE-bench dataset.
Findings indicate that LLM/agent-generated patches introduce significantly more vulnerabilities, often linked to code and issue characteristics like broader edits and missing contextual information.

L0: REINFORCEMENT LEARNING TO BECOME GENERAL AGENTS

L0 (L-Zero): introduces a scalable, end-to-end training pipeline for general-purpose agents, featuring the NB-Agent scaffold, LLM, Python Environment (Jupyter Kernel), Notepad, Context Watcher, Predefined Tools, AgentRL framework, Training Engine (FSDP), Inference Server (SGLang), Agent Workers, Single Controller, Agentic Policy Gradient, Agentic Reward, Dynamic Sampling, and Brwap Sandbox.
The NB-Agent operates in a "code-as-action" paradigm within an interactive Python environment, using an LLM to generate reasoning and code executed in a Jupyter kernel.
The AgentRL framework trains the NB-Agent using a verifiable reward model and a scalable infrastructure for parallel multi-turn rollouts, enabling robust problem-solving skills via reinforcement learning.

EPITOME: PIONEERING AN EXPERIMENTAL PLATFORM FOR AI-SOCIAL SCIENCE INTEGRATION

Epitome: introduces, "Epitome, an experimental platform for AI-social science integration, includes Foundation Model Layer (Anchors platform with LLMs), Complex Application Development Layer (Accelerates experimental insights translation), Human-AI Collaborative Experimental Environment Layer (Supports human-AI interactions), Experimental Randomization Experimental Intervention Layer (Operationalizes experimental rigor), Data Visualization Data Collection Layer (Provides unified data cockpit), Canvas-Based Interactive Experimental Design (User-friendly visual experiment design), Experiment Management System (Manages experiments), User Management System (Manages users), My Bot (Intelligent dialogue module), My Chatroom (Multi-agent interactive chatroom), Town Simulation (Multi-agent virtual environment), Workflow Auto-Planning Algorithm (Automates complex workflows), Low-Code Development Platform (Dify) (Enables low-code development), My Materials (Upload intervention materials), My Questionnaire (Supports data collection formats), and Data Acquisition Cockpit (Integrates data visualization/collection), where "Epitome provides a comprehensive platform for designing, implementing, and analyzing social science experiments involving LLMs."
The platform features a five-layer functional framework and several key modules to bridge methodological gaps in human-AI interaction research.
Its canvas-based interface and integrated tools enable researchers to easily design complex experimental scenarios and automate data collection and analysis.

29th June 2025

Do LLMs Dream of Discrete Algorithms?

AI Agent: introduces a neurosymbolic approach augmenting LLMs with logic-based reasoning and modular tools, structured by MVC, enabling decomposition and orchestration.
The AI Agent architecture includes an Agent Core for orchestration, Memory for information storage, a Planner guided by logic reasoning, and various Tools for specific tasks.
This hybrid approach enhances reliability and interpretability for multi-step reasoning tasks by combining probabilistic LLMs with formal logic systems.

ATGen: A Framework for Active Text Generation

ATGen: introduces a comprehensive framework bridging active learning with text generation tasks, enabling AL-empowered annotation using human or LLM-based agents.
The framework provides a unified platform for implementing and benchmarking AL strategies tailored to NLG tasks.
- It includes a web GUI, various AL strategies, support for LLM integration, efficient model tools, evaluation metrics, and a benchmarking platform.

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

Multi-Agent Public Goods Game Simulation: introduces, "a simulation framework", with Environment (Central coordinator), Institutions (Rule frameworks), and Agents (Autonomous decision-makers), where "the framework models LLM agents navigating a public goods dilemma with institutional choice and norm enforcement".
The simulation includes two types of Institutions, Sanctioning and Sanction-Free, allowing agents to choose environments with or without costly norm enforcement mechanisms.
Agents make decisions on institution choice, contribution, and sanctioning based on their history and anonymized group data, with their reasoning captured for analysis.

From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows

LLM-Powered AI Agent Communications: surveys threats in these systems, including Agent A (MCP Host), Agent B (MCP Host), MCP Server, A2A Server, Local Data Source, Remote Service API, Agent Framework (Large Language Model), A2A Client, A2A protocol, Web Browser - User, and Public Knowledge Source components.
The paper introduces a unified, end-to-end threat model categorizing over thirty attack techniques across input manipulation, model compromise, system/privacy, and protocol vulnerabilities.
This work provides a comprehensive reference for designing robust defenses and establishing best practices for resilient LLM-agent workflows.

AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks

AURA: introduces the first open-source, speech-to-speech task-oriented agent combining reasoning and tool use, featuring UI (user interface), ASR Module (speech recognition), TTS Module (text-to-speech), Dialog Processing Unit (processes dialogue) with Controller (central orchestrator), Agent (interleaves reasoning action), Actions (executable operations), Observation (environmental feedback), Dialog State Tracking (tracks dialogue state), and State (system memory) including Action-Observation History (action observation sequence), Conversation History (filtered chat history), and Dialog State (structured dialogue info), an LLM Server (hosts language model) with LLM (language model), Inference Engine (memory efficient inference), and ReAct Response Format (structured LLM output), and External APIs (Tools) (real-world services).
The system employs a cascaded architecture and integrates a ReAct-style agent to manage multi-turn dialogue and dynamic tool invocation for complex, goal-driven tasks.
AURA supports tools like calendar booking, contact lookup, web search, and email, demonstrating strong performance on VoiceBench and human evaluations for real-world task execution.

Voyager Vision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems

Voyager Vision: introduces a multimodal open-ended learning system for Minecraft, with a curriculum agent (proposes next task), action agent (generates action code), critic agent (verifies task success), skill library (stores successful solutions), environment (Minecraft game world), visual inputs (agent's POV screenshot), textual inputs (environment data/status), and LLM (underlying multimodal model).
The system extends the original Voyager framework by incorporating visual inputs (screenshots) alongside textual inputs to enable building tasks in addition to resource gathering.
The agents iteratively interact with the Minecraft environment, using multimodal inputs and self-verification to learn and perform tasks, storing successful code in the skill library.

Benchmarking Deep Search over Heterogeneous Enterprise Data

HERB (Heterogeneous Enterprise RAG Benchmark): introduces, a new benchmark for evaluating RAG systems on deep search over heterogeneous enterprise data, with Data Sources, Product Lifecycle Workflows, Enterprise Query Types, Workflow-guided Synthesis, LLM Simulation, Artifacts, Queries, Answerable QA Pairs, and Unanswerable Queries.
The benchmark features a synthetic data pipeline simulating enterprise workflows to generate interconnected artifacts and realistic multi-hop questions with guaranteed ground-truth answers.
It includes noise and unanswerable queries to stress-test RAG systems' ability to handle complex, dispersed knowledge and identify missing information.

Integrating Large Language Models in Financial Investments and Market Analysis: A Survey

MarketSenseAI: introduces an AI-driven framework utilizing GPT-4 for stock selection and portfolio management, with news summarizer, financial fundamentals analyzer, stock price dynamics module, macroeconomic environment summarizer, and decision-making layer components.
This framework integrates diverse data sources and applies CoT and ICL techniques for investment decision-making.
The decision-making layer uses GPT-4 as an expert analyst to produce actionable investment signals with explanations.

28th June 2025

Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Interlocutor Awareness Evaluation Setup: introduces a systematic evaluation of LLMs' ability to identify and adapt to conversational partners, utilizing LLMs acting as Identifier, Target, Sender, Solver, Player, Judge, Jailbreaker, and an Interpreter model.
The evaluation assesses LLM interlocutor awareness across reasoning patterns, linguistic style, and alignment preferences.
Case studies demonstrate the impact of this awareness on multi-agent cooperation, alignment, and safety.

Jan-nano Technical Report

Jan-nano: introduces Jan-nano, with Jan-nano (Language model), Multi-stage RLVR system (Training methodology), MCP (Tool integration protocol), Local RAG Server (Simulated search/scrape), Tools (Web search/scrape functions), which is a 4B parameter language model specialized for tool use and information retrieval, trained using a novel multi-stage RLVR system.
The training utilizes a local RAG server to simulate search and scrape tools, completely eliminating reliance on next token prediction training.
Evaluation and deployment leverage the Model Context Protocol (MCP) for flexible tool integration and agentic capabilities, demonstrating strong performance on tool usage tasks.

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

DICE-BENCH: introduces a framework to evaluate LLM function-calling in multi-round, multi-party dialogues, utilizing Tool Collections, Tool Graph Construction, Scenario Configuration, Dialogue Generation with Parameter Generation and Dialogue Simulation via a Multi-Agent System (Agents, Orchestrator), processed through a Validation Pipeline (Automatic Evaluation, Rule-Based Filtering, Criteria-Based Filtering), and quantified by the DICE-SCORE metric.
The framework generates realistic function-calling datasets by synthesizing conversations based on tool dependencies and distinct agent personas.
DICE-SCORE measures the dispersion of tool-related information across dialogue turns, correlating with task difficulty for LLMs.

Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

RAG and Agent Based Dialog Systems: introduces finetuning LLMs with domain-specific data and external knowledge (KAFT) within RAG and agent architectures, including Retriever (retrieves knowledge), Generator (LLM) (generates response), Decision Maker (decides search), and API Calling (calls search APIs).
The RAG system architecture comprises a Retriever and a Generator (LLM), while the agent system architecture includes a Decision Maker, API Calling, and a Generator (LLM).
KAFT is applied to the Generator (LLM) in both RAG and agent systems and the Decision Maker in the agent system to improve the utilization of external knowledge.

Memory as a Service (MaaS): Rethinking Contextual Memory as Service-Oriented Modules for Collaborative Agents

MaaS (Memory as a Service): introduces a service-oriented perspective for contextual memory in LLM-based agent systems, proposing a dual architecture with Memory Containers, a Memory Routing Layer, and a Fine-grained permission control mechanism to enable governable cross-entity memory sharing.
The framework decouples contextual memory from its local state, encapsulating it as independently callable, dynamically composable, and finely governed service modules.
MaaS aims to dismantle memory silos and support complex, long-term collaboration across diverse entities while rigorously respecting the private nature of memory assets.

FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets

FairMarket-RL: introduces a multi-agent reinforcement learning framework for peer-to-peer markets, incorporating a Large Language Model (LLM) as a real-time fairness critic to guide agent rewards.
The framework utilizes Independent Proximal Policy Optimization (IPPO) for agent training, blending raw economic rewards with LLM-generated fairness scores (Fairness-To-Buyer, Fairness-Between-Sellers) via a scheduled shaping mechanism.
FairMarket-RL demonstrates improved fairness and efficiency in simulated P2P energy trading, achieving high demand fulfillment and balanced profits by replacing static rules with dynamic LLM feedback.

27th June 2025

Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision

KGMAF: introduces a knowledge-guided multi-agent framework for automated requirements development, composed of Agents (LLM-based entities) and an Artifacts Pool (Central artifact repository).
The agents collaborate by monitoring and interacting with the artifacts pool, which stores intermediate and final requirements artifacts.
Each agent is equipped with specific functionality, predefined actions, planning mechanism, and injected knowledge to perform requirements tasks.

URSA: The Universal Research and Scientific Agent

URSA (The Universal Research and Scientific Agent): introduces a scientific agent ecosystem for accelerating research tasks, consisting of a Planning Agent (Breaks down problems), Execution Agent (Carries out tasks), Research Agent (Gathers online info), Hypothesizer Agent (Generates hypotheses), ArXiv Agent (Summarizes research papers), LLMs (Backend models), LangGraph (Agent framework), DuckDuckGo Search Tool (Performs web search), Web Scraping/Parsing Tool (Extracts web content), Command Line Tool (Executes system commands), Write Code Tool (Writes code files), Run Physics Code Tool (Executes physics simulations), ArXiv Search Tool (Searches ArXiv API), and Vision Model (Processes images).
The framework utilizes a set of modular, composable agents coupled with tool use to hypothesize, plan, and execute research tasks, building on large language model capabilities.
URSA demonstrates the potential for agentic AI to address scientific problems of varied complexity, including leveraging advanced physics simulation codes for design automation.

REXBENCH: Can coding agents autonomously implement AI research extensions?

REXBENCH: introduces a benchmark for evaluating LLM agents' ability to implement research extensions, with Input (Research paper, Codebase, Task instruction), Agent Execution (LLM Agent, Patch file generation), Agent Evaluation Infra (Virtual Machine, Task Execution, Evaluation Metrics), and Evaluation (Results from Experiment, Final Success Rate calculation) components, where the paper evaluates LLM agents on realistic research extension tasks using an automatic evaluation infrastructure.
The benchmark consists of 12 tasks based on existing research papers and codebases, requiring agents to implement novel extensions and produce code changes.
An automatic evaluation infrastructure executes the agent-generated code in controlled virtual machines and assesses performance using metrics like File Recall, Execution Success Rate, and Final Success Rate, revealing that current agents struggle with these tasks.

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Dyadic Motion Models: introduces a framework for generating dyadic audiovisual motion, including Speech Tokenizer (processes audio), Face & Body Feature Extractor (processes user visual), Dyadic Motion Model (generates motion features), Speech Model (provides LLM features), Valence Adapter (maps to valence codes), Arousal Adapter (maps to arousal codes), Gesture Adapter (maps to gesture codes), Face Adapter (maps generic to personalized face features), Body Adapter (maps body features to avatar rig), 3D Full-Body Codec Avatar Decoder (renders 3D avatar), Gaussian Splatting (3D rendering technique), and Linear Blend Skinning (deforms avatar mesh).
The framework utilizes dyadic audio and optional user visual input to generate intermediate face and body motion features.
LLM integration via adapters enables controllable emotion and gesture generation, while adapters and a 3D decoder facilitate photorealistic avatar rendering for interactive agents.

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

AI Research Agent: introduces, "The Automated LLM Speedrunning Benchmark", with LLM (Large Language Model), Search Scaffold (Iteratively uses LLM), Coder (Generates/modifies code), Executor (Runs code), Analyzer (Summarizes execution results), Knowledge (External information source), and History (Record of attempts), evaluating the ability of AI agents to reproduce NanoGPT speedrun improvements.
The benchmark tasks agents with reproducing successive speedrun records, providing the previous record's script and optional hints in various formats.
The AI research agent, composed of an LLM and a search scaffold, attempts to reproduce the record, and its performance is measured by the fraction of speedup recovered and code similarity.

Exploring Modularity of Agentic Systems for Drug Discovery

smolagent framework: evaluates the modularity of LLM-based agentic systems for drug discovery, with Code Agent (Writes and executes code), ToolCalling Agent (Uses external tools), LLM (Backbone language model), System prompt (Agent instructions), Tools (Cheminformatics functions), LLM Judge (Evaluates agent answers), where the system uses different agents, LLMs, and prompts to answer cheminformatics questions, evaluated by an LLM-as-a-judge.
The study compares the performance of CodeAgent and ToolCallingAgent types, seven different LLMs, and three system prompts on a set of 26 cheminformatics questions.
Performance is assessed using an LLM-as-a-judge system that scores agent answers based on expected answers, highlighting the dependence of performance on LLM, agent type, and prompt.

Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism

TWONs (Twins of Online Social Networks): introduces a formal framework for simulating social networks with Agents (Social media users) having Agent State (Agent's discourse history) and Communicative Behavior (Agent generates messages), interacting via Network Mechanics (Adapts incoming messages), focusing on Imitating User Behavior (Estimate agent behavior function) including Imitating Posting Behavior (Model content generation), Imitating Replying Behavior (Model reply generation), and Estimating Replying Likelihood (Predict reply probability) using LLMs (Basis for agents) with Fine-Tuning (Adapting LLMs) and a BERT-based Encoder (Embeds text for likelihood).
The paper empirically tests LLM-based imitation of user behavior on X (formerly Twitter) in English and German, benchmarking empirical realism.
Findings suggest fine-tuning and language-specific considerations are crucial for achieving realistic social simulations with generative agents.

More Vulnerable than You Think: On the Stability of Tool-Integrated LLM Agents

Tool-Integrated LLM Agent: introduces an evaluation of the stability of tool-integrated LLM agents, focusing on vulnerabilities during the tool invocation process related to tool documentation, tool usage hallucination, and tool response attacks.
The study investigates how internal and external factors impact agent performance and stability when interacting with external tools using the ReAct framework.
Experiments reveal that agents are highly susceptible to errors at each stage of tool invocation, with open-source models generally more vulnerable than proprietary ones.

CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design

CAL-RAG (Retrieval-Augmented Multi-Agent Generation): introduces a framework for content-aware layout generation using a Layout Recommender Agent (Suggests initial layout), Layout Generation Tool (Creates visual representation), Grader Agent (Evaluates layout quality), and Feedback Agent (Provides refinement feedback).
The framework operates iteratively, retrieving relevant layout examples, proposing structured element placements, evaluating the generated layout based on visual metrics, and providing targeted refinements.
This multi-agent system combines retrieval augmentation with agentic reasoning to achieve scalable, interpretable, and high-fidelity automated layout generation.

ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation

ARAG (Agentic Retrieval-Augmented Generation): introduces a multi-agent framework for personalized recommendation, integrating a User Understanding Agent (Summarizes user preferences), Natural Language Inference Agent (Evaluates semantic alignment), Context Summary Agent (Summarizes NLI-filtered evidence), and Item Ranker Agent (Generates ranked list) to refine context retrieval and item ranking.
The framework leverages specialized LLM-based agents that collaborate in a blackboard-style system to process user context and candidate items.
This agentic approach enhances context awareness, semantic grounding, and personalization in recommendation systems by decomposing the task into distinct reasoning steps.

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

SPAZER: introduces a VLM-driven agent for zero-shot 3D visual grounding, integrating 3D spatial localization and 2D semantic verification through a progressive reasoning process with 3D Holistic View Selection (Generates, selects optimal 3D view), Candidate Object Screening (Filters, ranks potential objects), 3D-2D Joint Decision-Making (Integrates 3D/2D for final grounding), and VLM (Core reasoning, decision-making engine) components.
The approach leverages holistic 3D rendered views for global spatial context and incorporates retrieval-augmented candidate screening for enhanced robustness.
SPAZER performs 3D-2D joint decision-making by combining information from selected 3D views and relevant 2D camera images to identify the target object.

A LARGE LANGUAGE MODEL-EMPOWERED AGENT FOR RELIABLE AND ROBUST STRUCTURAL ANALYSIS

LLM-empowered Agent: introduces a framework for reliable and robust structural analysis by reframing the task as code generation, utilizing an LLM (Generates code) guided by a Prompt Engineering Layer (Constructs prompt) with structured Prompt Design (Structures prompt) components including Role specification (Assigns persona), Chain of thought (Guides reasoning), A complete example (Provides full example), Function usage examples (Shows code usage), and Prescriptive Constraints (Enforces rules), and integrating external tools like a Code Execution Tool (Runs code) and a Visualization Tool (Visualizes results).
The Prompt Engineering Layer structures the input to the LLM using a detailed template to improve code generation accuracy and domain alignment for structural analysis problems.
The agent automatically executes the generated OpenSeesPy code and visualizes results using OpsVis, providing a reliable and interpretable workflow for automating structural analysis.

26th June 2025

CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation

CitySim: introduces a large-scale urban simulation framework using LLM-powered agents with Persona Module (Demographics, traits, habits), Memory Module (Temporal, reflective, spatial), Belief Module (Updates POI beliefs), Needs Module (Tracks, prioritizes needs), Long-Term Goal Module (Forms, revises aspirations), Perception Module (Observes environment, reacts), Planning Module (Generates daily schedules), Place Selection Module (Determines activity location), Vehicle Selection Module (Selects transport mode), and Social Module (Manages social interactions) to model human-like behavior.
The framework enables agents to generate realistic daily schedules and long-term plans through recursive, value-driven planning, balancing mandatory activities, personal habits, and situational factors.
CitySim agents are equipped with spatial and temporal memories to recall experiences, form beliefs about places, and adapt future decisions, demonstrating closer alignment with real human behavior than prior work.

MobiVerse: Scaling Urban Mobility Simulation with Hybrid Lightweight Domain-Specific Generator and Large Language Models

MobiVerse: introduces a hybrid framework for urban mobility simulation, combining a Domain-Specific Generator for base activity chains with an LLM Activity Chain Modifier for context-aware adaptation, integrated within a Visualized Simulation Environment using SUMO.
The framework utilizes a SUMO Controller for simulation execution and data collection, a Trajectory Viewer for visualization, and global data stores (Road Network, POI Info, Agent Info) for system-wide access.
Supporting components like the PromptManager, RoadClosureHandler, and EventHandler manage LLM interactions and specific environmental events to enhance behavioral realism and scalability.

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

SEEA-R1 (Self-Evolving Embodied Agents-R1): introduces a reinforcement fine-tuning framework for embodied agents, featuring Policy Model (Predicts actions), Reward Model (MGRM) (Predicts task outcomes), Data Evolution (MCTS) (Generates experience), Model Evolution (Tree-GRPO) (Updates models), Environment (Provides observations/rewards), and Experience Dataset (Stores interaction data).
The framework drives continuous improvement through iterative Data Evolution and Model Evolution cycles, using MCTS for experience generation and Tree-GRPO for policy updates.
It utilizes a Multi-modal Generative Reward Model (MGRM) to provide dense, generalizable reward signals, reducing dependence on sparse environment rewards.

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Proactive Alignment Framework: introduces a method that simulates long-term societal consequences of LLM advice using World Modeling and an Event Scripting Model to generate Feedback, which is then used by an Improver to refine responses.
The framework explores causal event trajectories via Event Trajectory Search, identifies affected population Strata, and generates Agent Feedback for these groups.
This approach enhances LLM risk awareness, leading to Refined Responses that are safer, achievable through inference-time refinement or offline Realignment Training.

ParEval-RepO: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

LLM-based translation techniques: introduces PAREVAL-REPO, a benchmark suite for evaluating repository-level HPC translation using Non-agentic method (file-by-file translation), Top-down agentic method (multi-agent system), and SWE-agent (autonomous coding agent).
The Top-down agentic method comprises a Dependency agent (determines file dependencies), Context agent (manages translation changes), Chunk agent (splits large files), and Translation agent (translates code chunks).
Evaluation metrics (assess translation quality) and Error analysis (identifies translation errors) are used to assess various LLMs and techniques, highlighting challenges in build system generation and cross-file dependencies.

LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Multi-Agent Framework: introduces LLM agents (ContextAgent, ParameterAgent, ValidationAgent, SimulationAgent, SuggestionAgent) collaborating within a GroupChat environment, leveraging IDAES simulation, for chemical process optimization.
The framework operates in two phases: autonomous constraint generation by the ContextAgent followed by iterative optimization guided by the other agents.
This approach addresses the constraint definition bottleneck in traditional optimization by autonomously inferring operating bounds from minimal descriptions.

LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Multi-Agent Framework: introduces LLM agents (ContextAgent, ParameterAgent, ValidationAgent, SimulationAgent, SuggestionAgent) collaborating within a GroupChat environment, leveraging IDAES simulation, for chemical process optimization.
The framework operates in two phases: autonomous constraint generation by the ContextAgent followed by iterative optimization guided by the other agents.
This approach addresses the constraint definition bottleneck in traditional optimization by autonomously inferring operating bounds from minimal descriptions.

FaSTA*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

FaSTA* (Fast-Slow Toolpath Agent): introduces a neurosymbolic agent for multi-turn image editing, including LLM (high-level planning / reasoning), VLM (quality checking), Subroutine Rule Table (learned subroutines / rules), and Online Subroutine Learning Mechanism (learns / refines subroutines).
It utilizes an Adaptive Fast-Slow Execution Strategy (fast / slow planning) that attempts learned subroutines first and triggers A* Search (low-level toolpath search) as a fallback, supported by Knowledge Structures (TDG / MDT / BT) and AI Tools (image editing operations).
This method achieves substantial cost savings in execution time while maintaining image editing quality comparable to state-of-the-art baselines.

Theory of Mind in Action: The Instruction Inference Task

Tomcat (LLM-based agent): introduces Tomcat, with LLM, Common Ground, Demonstration Exemplars, Instruction Interpretation and Intention Inference, Response Generation, and Outputs components, designed to interpret indirect instructions and infer principal intentions in a collaborative task.
The framework leverages in-context learning via Demonstration Exemplars (Few-shot CoT or Commonsense Prompt) and Common Ground to guide the LLM's reasoning process.
Tomcat generates structured outputs including action plans, natural language responses, and instruction type classifications to assist a human principal in a gridworld environment.

Hierarchical Reasoning Model

HRM (Hierarchical Reasoning Model): introduces a novel recurrent architecture with an Input Network (converts tokens to vectors), Low-level (L) Module (rapid, detailed computations), High-level (H) Module (slow, abstract planning), Output Network (transforms states to probabilities), Adaptive Computational Time (ACT) (dynamic halting strategy), Q-head (predicts halt/continue actions), One-step Gradient Approximation (efficient gradient computation), Deep Supervision Mechanism (multi-segment feedback), Transformer Blocks (core recurrent module architecture), Rotary Positional Encoding (enhances Transformer blocks), Gated Linear Units (enhances Transformer blocks), RMSNorm (layer normalization variant), and Post-Norm Architecture (improves stability), designed to achieve significant computational depth and efficiency for complex reasoning tasks.
Inspired by hierarchical and multi-timescale brain processing, HRM employs two interdependent recurrent modules that collaborate to solve tasks in a single forward pass without explicit intermediate process supervision.
With only 27 million parameters and 1000 training samples, HRM achieves high performance on challenging reasoning tasks like Sudoku and maze navigation, outperforming larger LLMs and Chain-of-Thought methods.

25th June 2025

GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization

GPU Kernel Scientist: introduces an automated, iterative framework for GPU kernel optimization, including Population (Stores kernels and performance data), LLM Evolutionary Selector (Selects kernels for iteration), LLM Experiment Designer (Designs optimization experiments), LLM Kernel Writer (Generates modified kernel code), and Benchmarking Platform (Evaluates kernel performance).
The framework leverages large language models across three core stages to iteratively refine GPU kernels based on performance feedback from an external evaluation system.
This LLM-driven approach aims to bridge knowledge gaps and accelerate kernel optimization, particularly in environments with limited documentation or tooling.

Poster: Enhancing GNN Robustness for Network Intrusion Detection via Agent-based Analysis

LLM Mitigation Pipeline: introduces an approach integrating LLM analysis into a GNN-based NIDS pipeline, including initial data processing, parameter configuration, data preprocessing, graph summarization, LLM agent analysis, and output generation for the GNN.
The pipeline employs LLM agents as simulated cybersecurity experts to analyze network graph elements and identify suspicious components before GNN processing.
This LLM-based mitigation strategy aims to enhance GNN resilience against realistic attacks like node injection by filtering or flagging malicious graph elements.

A SURVEY OF AI FOR MATERIALS SCIENCE: FOUNDATION MODELS, LLM AGENTS, DATASETS, AND TOOLS

AI4MS: introduces, "Common & Prevalent Tasks (broad application areas), Foundation Models (large pretrained models), Datasets (data collections), Tools & Infrastructures (supporting software platforms), and Successes, Limitations & Challenges, Future Directions (discussion points), where the paper surveys the landscape of AI for materials science."
The survey categorizes Foundation Models into Unimodal, Multimodal, and LLM Agents, and Datasets into Computational/Experimental and LLM Development.
It also reviews Tools & Infrastructures for Data Analysis/Management and Model Development, and discusses the current state and future directions.

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

DECRYPTO: introduces a multi-agent benchmark for evaluating language models, featuring Alice (Chooses hints), Bob (Guesses code), and Eve (Intercepts code) interacting over shared Keywords (Secret words), Code (Secret digits sequence), and Hints (Words for code), tracked via Hint History (Past hints) and Code History (Past codes), and evaluated using Generalist Agents (Out-of-the-box LLMs), Specialist Agents (Task-specific agents), and specific Theory of Mind Tasks (Cognitive experiments).
The benchmark is based on a language game requiring players to reason about others' knowledge and beliefs to succeed in cooperative and competitive settings.
DECRYPTO provides a platform for studying multi-agent reasoning, theory of mind, and human-AI interaction in interactive, language-based scenarios.

Memento: Note-Taking for Your Future Self

Memento: introduces a three-stage strategy, with Plan generation (Decomposes question into steps), Prolog query (Symbolic representation of steps), Definitions (Natural language predicate mapping), Database construction (Populates fact database), Prolog database (Stores extracted/verified facts), Query execution (Evaluates query for answer), and LLM (Performs tasks in stages), which decomposes complex tasks, records outcomes, and uses Prolog for structured reasoning.
The method operates in three phases: plan generation, database construction, and query execution, leveraging LLMs to generate symbolic plans and populate a Prolog database.
Memento uses Prolog queries and a dynamically constructed database of facts to answer multi-hop questions, combining symbolic structure with LLM flexibility.

Memento: Note-Taking for Your Future Self

Memento: introduces a three-stage strategy, with Plan generation (Decomposes question into steps), Prolog query (Symbolic representation of steps), Definitions (Natural language predicate mapping), Database construction (Populates fact database), Prolog database (Stores extracted/verified facts), Query execution (Evaluates query for answer), and LLM (Performs tasks in stages), which decomposes complex tasks, records outcomes, and uses Prolog for structured reasoning.
The method operates in three phases: plan generation, database construction, and query execution, leveraging LLMs to generate symbolic plans and populate a Prolog database.
Memento uses Prolog queries and a dynamically constructed database of facts to answer multi-hop questions, combining symbolic structure with LLM flexibility.

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Behavior Editing: introduces steering LLM-based agents' ethical behavior by editing the agent (Pre-edit Agent) to become a Post-edit Agent, enabling both benevolent and malicious steering.
The approach frames agent behavior steering as a model editing task, allowing precise and efficient modifications to influence behavior and moral alignment.
The BEHAVIORBENCH benchmark is developed to systematically evaluate this editing approach across diverse ethical scenarios and complexity levels.

Fine-Tuning and Prompt Engineering of LLMs, for the Creation of Multi-Agent AI for Addressing Sustainable Protein Production Challenges

Multi-Agent AI System: introduces a Retrieval-Augmented Generation (RAG)-oriented system for sustainable protein production research, with a Literature Search Agent (retrieves literature), Information Extraction Agent (extracts information), Pool of Scientific Literature (literature source), User Interface (user interaction), Toxicity Analysis Module (screens for toxicity), GPT-Based LLM (agent foundation), Prompt Engineering (optimisation method), Fine-Tuning (optimisation method), and External Sentence Transformer (evaluation tool).
The study compares fine-tuning and prompt engineering as methods to optimise the performance of the information extraction agent using GPT models.
This multi-agent system aims to automate the process of retrieving and extracting key information from scientific literature to accelerate research in sustainable protein production.

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

DeepRare: introduces an LLM-powered agentic system for rare disease diagnosis, with Central Host (Coordinates workflow, synthesizes info), Memory Bank (Stores diagnostic information, context), Agent Servers (Execute specialized tasks), Phenotype Extractor (Converts free-text to HPO), Phenotype Analyzer (Analyzes HPO, suggests diseases), Knowledge Searcher (Retrieves medical documents, web), Case Searcher (Finds similar patient cases), Genotype Analyzer (Annotates, ranks genetic variants), Disease Normalizer (Standardizes disease names), External Data Sources (Provide diagnostic evidence), Medical Literature (Peer-reviewed publications), Rare Disease Knowledge (Curated rare disease info), General Knowledge (Broad clinical resources), Case Collection (Repository of patient cases), and Gene Variant Databases (Genetic variant information) components, designed to process heterogeneous clinical inputs and generate traceable diagnostic reasoning.
The system employs a three-tier architecture comprising a central host, specialized agent servers, and diverse external data sources to facilitate complex diagnostic reasoning.
DeepRare generates ranked diagnostic hypotheses with transparent reasoning chains linked to verifiable medical evidence, enhancing interpretability and supporting clinical adoption.

SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models

SV-LLM (multi-agent assistant system): introduces a multi-agent framework for SoC security verification, with Application, Supervisor, Orchestrator, Agent, Data, and Infrastructure layers, designed to automate and enhance the verification workflow.
The system employs specialized LLM-driven agents for tasks including security Q&A, asset identification, threat modeling, test plan generation, vulnerability detection, and bug validation.
The layered architecture and agentic design aim to streamline complex verification tasks, reduce manual effort, and improve accuracy and scalability in hardware security analysis.

TAPS: Tool-Augmented Personalisation via Structured Tagging

TAPS (Tool-Augmented Personalisation via Structured Tagging): introduces a tuning-free approach for personalised tool use in task-oriented dialogue, combining an LLM (Generates response, predicts API calls), a Structured Tagging Tool (Augments data, adds tags), and an Uncertainty-based Tool Detector (Determines tool use, assesses confidence).
The framework leverages structured tagging to create an intermediate representation between natural language and API calls, enhancing argument extraction.
An uncertainty-based tool detector determines when to apply the structured tagging tool to improve performance.

Language Modeling by Language Models

Genesys: introduces an autonomous system for discovering novel language model architectures, with LMADE (Environment), Knowledge Engine (Knowledge access), Reference Library (Curated papers), External Sources (Search tools), Paper Vector DB (Vector database), Verification Engine (Verification tools), Symbolic Checker (Code analysis), Automated Trainer (Model training), Automated Evaluator (Model evaluation), Auto-Tuner (Parameter tuning), Runtime Checker (Training monitor), Evolutionary Tree (Design storage), LLM-driven Agents (Discovery agents), Designer Agents (Design creation), Proposer Agent (Proposal generation), Reviewer Agent (Proposal review), Planner Agent (Implementation planning), Coder Agent (Code writing), Observer Agent (Code review), Verifier Agents (Verification management), Generalized Autoregressive Block (Main architecture unit), Generalized Autoregressive Unit (Composable sub-unit), Ladder-of-Scales (Multi-scale verification), and Unit-based Generation (Stepwise code generation).
The system simulates the research process from ideation to verification using LLM agents and a genetic programming backbone operating on a factorized design space.
Genesys employs a Ladder-of-Scales approach for efficient verification and unit-based code generation for improved design quality and efficiency.

PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models

PSALM-V: introduces a neuro-symbolic learning system that induces symbolic action semantics in visual environments by iteratively initializing/updating problem files, sampling trajectories, executing in the environment, predicting errors, generating/updating action semantics, and using a symbolic planner for verification.
The system maintains a tree-structured belief over action semantics, refining it based on execution outcomes and predicted errors to enable reliable planning without expert definitions.
PSALM-V dynamically infers PDDL problem files and domain action semantics by analyzing execution outcomes and synthesizing possible error explanations.

24th June 2025

Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

OIR (Open-Ended Instruction Relabeling): introduces a framework that leverages a Large Language Model to automatically generate open-ended instructions from collected agent trajectories, enriching training data for instruction-following reinforcement learning.
The framework uses the LLM to relabel unsuccessful trajectories by identifying accomplished subtasks, providing semantic rewards for efficient learning in sparse environments.
A prioritized instruction buffer manages the diverse, LLM-generated instructions, balancing exploration and exploitation for robust policy improvement.

QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

QHackBench: introduces a novel benchmark dataset and evaluation framework for LLM-based quantum code generation, featuring QHack Challenges, PennyLang Dataset, Retrieval, Code Generation Agent, Test Bench, Validation & Correction Agent, Self-Reasoning, and Augmented Query components.
The framework systematically evaluates LLMs using vanilla prompting, Retrieval-Augmented Generation, and a multi-agent iterative refinement pipeline on real-world quantum coding challenges.
Results indicate RAG and multi-agent approaches can enhance performance, highlighting the importance of domain-specific context and iterative debugging for reliable quantum code generation.

Prover Agent: An Agent-based Framework for Formal Mathematical Proofs

Prover Agent: introduces an agent-based framework for formal mathematical proofs, coordinating an Informal LLM (informal reasoning), Prover Model (formal proving), Lean (formal verification), and AutoFormalizer (formalizes lemmas).
The framework generates auxiliary lemmas via informal reasoning, formalizes them, proves them, and uses verified lemmas to synthesize the final proof.
Iterative refinement based on Lean feedback is used throughout the process to ensure correctness and improve proof construction.

JoyAgents-R1: Joint Evolution Dynamics for Versatile Multi-LLM Agents with Reinforcement Learning

JoyAgents-R1: introduces a joint evolution dynamics framework for heterogeneous multi-agent systems, including a master agent (orchestrates tasks), sub-agents (specialized task execution), agent memory (stores past information), tools (external functionalities), joint evolution dynamics (training process), and joint reward function (calculates action feedback).
The framework employs a hierarchical architecture where the master agent delegates tasks to specialized sub-agents that interact with tools and memory.
Joint evolution dynamics leverages GRPO with node-wise Monte Carlo sampling and marginal benefit updating, while memory evolves adaptively using GRPO rewards.

MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration

MAM (Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis): introduces a modular, collaborative framework with General Practitioner (Initial triage/referral), Specialist Team (Domain expert agents), Radiologist (Image analysis agent), Medical Assistant (Information retrieval/summary), and Director (Orchestrator/synthesizer) agents for multi-modal medical diagnosis.
The framework decomposes the diagnostic process into specialized roles, each embodied by an LLM-based agent, enabling efficient knowledge updates and leveraging existing models.
Agents collaborate through a defined workflow involving initial triage, problem decomposition, information retrieval, diagnostic opinion generation, discussion, report synthesis, consensus, and final diagnosis derivation.

LLM-Based Social Simulations Require a Boundary

LLM-Based Social Simulations: introduces boundaries for reliable social science contributions, focusing on LLM Agents (model individual behavior), Alignment (simulated behaviors match real-world), Consistency (coherent agent behavior over time), and Robustness (reproducibility under conditions).
The paper argues that LLMs' inherent limitations, particularly lack of behavioral heterogeneity, constrain their reliability for simulating complex social dynamics.
It proposes heuristic boundaries and a checklist to guide researchers in determining the appropriate scope and claims for such simulations in social science research.

SAGE: Strategy-Adaptive Generation Engine for Query Rewriting

SAGE (Strategy-Adaptive Generation Engine): introduces a reinforcement learning framework for query rewriting that integrates a Policy Model guided by Explicit Strategic Primitives, evaluated by an Environment, and trained using a Reward Shaping Module with Strategic Credit Shaping and Contrastive Reward Shaping, enhanc

Name		Name	Last commit message	Last commit date
Latest commit History 1,358 Commits
resources		resources
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers: 2025

1st August 2025

31st July 2025

30th July 2025

29th July 2025

28th July 2025

27th July 2025

26th July 2025

25th July 2025

24th July 2025

23rd July 2025

22nd July 2025

21st July 2025

20th July 2025

19th July 2025

18th July 2025

17th July 2025

16th July 2025

15th July 2025

14th July 2025

13th July 2025

12th July 2025

11th July 2025

10th July 2025

9th July 2025

8th July 2025

7th July 2025

6th July 2025

5th July 2025

4th July 2025

3rd July 2025

2nd July 2025

1st July 2025

30th June 2025

29th June 2025

28th June 2025

27th June 2025

26th June 2025

25th June 2025

24th June 2025

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages