Skip to content

Tool Use and Computer Use

📋Page Status
Quality:88 (Comprehensive)
Importance:82.5 (High)
Last edited:2025-12-28 (10 days ago)
Words:3.3k
Backlinks:1
Structure:
📊 6📈 1🔗 28📚 914%Score: 14/15
LLM Summary:Tool use capabilities evolved from basic API calls to superhuman computer control (OSAgent 76.26% vs 72% human baseline on OSWorld) by late 2025, with SWE-bench reaching 80.9%. However, OpenAI acknowledged prompt injection 'may never be fully solved,' with 94.4% of agents vulnerable and 100% of multi-agent systems exploitable through inter-agent attacks.
Capability

Tool Use and Computer Use

Importance82
Safety RelevanceVery High
Key ExamplesClaude Computer Use, GPT Actions
Related
Organizations

Tool use capabilities represent one of the most significant developments in AI systems, transforming language models from passive text generators into active agents capable of interacting with the external world. These capabilities span from simple API calls to sophisticated computer control, enabling AI systems to execute code, browse the web, manipulate files, and even operate desktop applications through mouse and keyboard control. The progression from Claude’s computer use beta in October 2024 to increasingly sophisticated implementations across major AI labs demonstrates the rapid advancement of this critical capability area.

This evolution matters because it fundamentally changes the nature of AI systems from advisory tools that can only provide text-based recommendations to autonomous agents capable of taking concrete actions in digital environments. The implications extend far beyond enhanced functionality—tool use capabilities create new attack surfaces, complicate safety monitoring, and enable both beneficial applications like automated research assistance and concerning uses like autonomous cyber operations. As these systems become more sophisticated, understanding their capabilities, limitations, and safety implications becomes crucial for responsible deployment and governance.

The trajectory toward more capable tool-using agents appears inevitable, with major AI labs investing heavily in this area. However, the dual-use nature of these capabilities—where the same functionality that enables beneficial automation also enables potential harm—presents unique challenges for safety research and policy development that distinguish tool use from other AI capability advances.

DimensionCurrent State (Late 2025)TrendSafety Relevance
Function CallingMature — BFCL benchmark shows Claude Opus 4.1 at 70.36%, standardized via MCPStableModerate — Well-defined interfaces enable monitoring
Web BrowsingAdvanced — ChatGPT agent/Operator integrated into main productsContinued improvementHigh — OpenAI admits prompt injection “may never be fully solved”
Code ExecutionStrong — SWE-bench Verified reaches 80.9% (Claude Opus 4.5)Rapid improvementHigh — Potential for malware, system manipulation
Computer UseSuperhuman — OSAgent 76.26% vs. 72% human baseline on OSWorldMilestone achievedVery High — Universal interface bypasses API restrictions
Multi-Agent OrchestrationAdvancing — MCP standardization enables cross-platform coordinationRapid developmentCritical — 100% of multi-agent systems vulnerable to inter-agent exploits

Technical Foundations and Current Implementations

Section titled “Technical Foundations and Current Implementations”

Modern tool use systems typically employ a structured approach where AI models receive descriptions of available tools, generate properly formatted function calls, execute these calls in controlled environments, and process the results to continue task completion. This architecture has been implemented with varying degrees of sophistication across major AI systems. OpenAI’s function calling, introduced in June 2023, established early patterns for structured API invocation with JSON schema validation and support for parallel tool execution. Google’s Gemini Extensions focused on deep integration with Google’s ecosystem, enabling cross-service workflows between Gmail, Calendar, and Drive.

Anthropic’s Computer Use capability, launched in public beta in October 2024, represents a significant advancement by enabling direct desktop interaction. The system can take screenshots, interpret visual interfaces, move the mouse cursor, and provide keyboard input to control any application a human could operate. This universal interface approach eliminates the need for custom API integrations, though it currently operates more slowly than human users and struggles with complex visual interfaces or applications requiring rapid real-time interaction.

The underlying technical implementation relies heavily on vision-language models that can interpret screenshots and translate high-level instructions into specific UI interactions. Training these systems involves a combination of supervised fine-tuning on human demonstrations, reinforcement learning from successful task completion, and synthetic data generation. The challenge lies in teaching models both the mechanical aspects of tool operation (correct function call formatting, proper argument passing) and the strategic aspects (when to use which tools, how to recover from errors, how to chain tools effectively).

Current limitations include frequent tool selection errors, brittle error recovery mechanisms, and difficulty with novel tools not seen during training. Most implementations require careful prompt engineering and work best with familiar, well-documented tools rather than adapting flexibly to new interfaces or APIs.

Performance on tool use benchmarks reveals both rapid progress and, in some cases, superhuman achievement:

BenchmarkTask TypeHuman BaselineBest AI (2024)Best AI (Late 2025)Key Insight
OSWorldComputer control72.4%14.9% (Claude 3.5)76.26% (OSAgent)Superhuman achieved Oct 2025
SWE-bench VerifiedCode issue resolution~92%49.0% (Claude 3.5)80.9% (Claude Opus 4.5)Near-human; 18x improvement since 2023
SWE-bench ProPrivate codebases22.7% (Claude Opus 4.1)More realistic; drops to 17.8% on commercial code
GAIAGeneral assistant tasks92%15% (GPT-4)75% (h2oGPTe)First “C grade” achieved in 2025
BFCLFunction calling70.36% (Claude Opus 4.1)GPT-5 at 59.22%; Chinese models competitive
WebArenaWeb navigation~35%~55%Realistic web task completion

The trajectory of improvement has been extraordinary. On OSWorld, the best AI agent went from 14.9% in October 2024 to 76.26% in October 2025—a 5x improvement that crossed the human baseline. Claude’s OSWorld performance improved 45% in just four months (from 42.2% to 61.4% with Sonnet 4.5). The dramatic improvement in SWE-bench scores—from 4.4% in 2023 to 80.9% by late 2025—illustrates how rapidly agentic coding capabilities are advancing.

However, performance drops significantly on private codebases that models haven’t seen during training. On SWE-bench Pro’s commercial code subset, Claude Opus 4.1 drops from 22.7% to 17.8%, and GPT-5 falls from 23.1% to 14.9%. This suggests current high scores may partially reflect training data contamination rather than genuine generalization.

Tool use capabilities introduce qualitatively different safety challenges compared to text-only AI systems. The fundamental shift from advisory outputs to autonomous action creates persistent consequences that extend beyond the AI system itself. When a language model generates harmful text, the damage remains contained to that output; when a tool-using agent executes malicious code or manipulates external systems, the effects can propagate across networks and persist indefinitely.

The expanded attack surface represents a critical concern. Each tool integration introduces potential vulnerabilities, from SQL injection through database APIs to privilege escalation through system command execution. Research by anthropic and other labs has demonstrated that current jailbreak techniques can be adapted to tool use contexts, where seemingly benign tool calls can be chained together to achieve harmful objectives. For example, a model might use legitimate web browsing tools to gather information for social engineering attacks, or combine file system access with network tools to exfiltrate sensitive data.

Monitoring and oversight become significantly more complex with tool-using agents. Traditional safety measures designed for text outputs—such as content filtering or human review of responses—prove inadequate when models can take rapid sequences of actions through external interfaces. The combinatorial explosion of possible tool interactions makes it difficult to anticipate all potential misuse patterns, and the speed of automated tool execution can outpace human oversight capabilities.

The challenge of maintaining meaningful human control becomes acute when agents can operate autonomously across multiple tools and time horizons. Current approaches like requiring human approval for specific actions face the fundamental tension between preserving utility (which requires minimizing friction) and maintaining safety (which requires meaningful oversight). As tool use becomes more sophisticated, this tension will likely intensify.

Research on AI agent security has revealed alarming vulnerability rates. According to a comprehensive study on agent security, the attack surface for tool-using agents is significantly larger than for text-only systems. According to OWASP’s 2025 Top 10 for LLM Applications, prompt injection ranks as the #1 critical vulnerability, appearing in over 73% of production AI deployments assessed during security audits.

Vulnerability TypePrevalenceSeverityExample Attack
Prompt Injection94.4% of agents vulnerable; OWASP #1 threatCriticalMalicious instructions hidden in web content
Retrieval-Based Backdoors83.3% vulnerableHighPoisoned documents trigger unintended behavior
Inter-Agent Trust Exploits100% vulnerableCriticalCompromised agent manipulates others in multi-agent systems
Memory PoisoningCommonHighGradual alteration of agent behavior through corrupted context
Excessive AgencyCommonHighOver-permissioned agents cause unintended damage

In December 2025, OpenAI made a significant admission: prompt injection “may never be fully solved.” In their technical blog on hardening ChatGPT Atlas, they stated that “prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved.’” The UK’s National Cyber Security Centre issued a similar warning that prompt injection attacks against generative AI applications “may never be totally mitigated.”

Real-world incidents have demonstrated these risks. The EchoLeak exploit (CVE-2025-32711) against Microsoft Copilot showed how infected email messages containing engineered prompts could trigger automatic data exfiltration without user interaction. Experiments with OpenAI’s Operator demonstrated how agents could harvest personal data and automate credential stuffing attacks. Brave’s security research on the Perplexity Comet vulnerability confirmed that indirect prompt injection is “not an isolated issue, but a systemic challenge facing the entire category of AI-powered browsers.”

OpenAI developed an “LLM-based automated attacker”—a bot trained using reinforcement learning to discover prompt injection vulnerabilities. Unlike traditional red-teaming, this system can “steer an agent into executing sophisticated, long-horizon harmful workflows that unfold over tens (or even hundreds) of steps.” In one demonstration, the automated attacker inserted a malicious email into a test inbox; when the AI agent scanned emails, it followed hidden instructions and sent a resignation message instead of the intended out-of-office reply.

Loading diagram...

McKinsey’s agentic AI security playbook emphasizes that organizations should enforce strong sandboxing with network restrictions, implement tamper-resistant logging of all agent actions, and maintain traceability mechanisms from the outset.

Computer use capabilities deserve special attention because they represent a universal interface that can potentially access any digital functionality available to human users. Unlike API-specific tool integrations that require custom development for each service, computer control enables AI agents to operate any software through the same visual interface humans use. This universality creates both tremendous potential and significant risks.

In October 2025, AI agents crossed the human baseline on computer control for the first time. AGI Inc.’s OSAgent achieved 76.26% on OSWorld, exceeding the approximately 72% human baseline. The agent uses continuous self-checking (the “verification-generation gap”): it verifies outcomes in real time and corrects on the next turn when a step fails. Training combines a general-reasoning base model with hundreds of thousands of synthetic tasks and real browser environments.

Model/SystemOSWorld ScoreDateKey Technique
Claude 3.5 (baseline)14.9%Oct 2024Vision-language + screenshot analysis
Claude 3.728.0%Feb 2025Improved planning and error recovery
Agent S2 + Claude 3.734.5%Mar 2025Specialized agentic scaffolding
Claude Sonnet 442.2%July 2025Enhanced tool use training
Claude Sonnet 4.561.4%Nov 202545% improvement in 4 months
Claude Opus 4.566.3%Nov 2025Extended autonomous operation
Agent S3 (Best-of-N)69.9%Oct 2025Behavior Best-of-N techniques
OSAgent76.26%Oct 2025Self-verification, synthetic data
Human baseline~72%

However, efficiency remains a significant limitation. Even high-performing agents take 1.4-2.7x more steps than necessary to complete tasks. What humans can accomplish in 30 seconds might take an agent 12 minutes—primarily because 75-94% of the time is spent on planning and reflection calls to large AI models rather than actual task execution.

Claude’s updated computer use tool (January 2025) added new capabilities including hold_key, left_mouse_down, left_mouse_up, scroll, triple_click, and wait commands, plus a zoom feature for viewing specific screen regions at full resolution. These granular controls enable more precise UI interactions that were previously unreliable.

The implications of reliable computer use extend across virtually every domain of human digital activity. Positive applications include accessibility tools for users with disabilities, automated testing and quality assurance, and research assistance that can navigate complex information systems. Concerning applications include automated social engineering attacks, mass surveillance through social media manipulation, and autonomous malware that can adapt to novel security measures.

Tool Integration Standards: Model Context Protocol

Section titled “Tool Integration Standards: Model Context Protocol”

The Model Context Protocol (MCP), announced by Anthropic in November 2024, represents a significant step toward standardizing AI-tool integration. MCP addresses what engineers called the “M×N problem”—the combinatorial explosion of connecting M different AI models with N different tools or data sources. By providing a universal protocol, developers implement MCP once and unlock an entire ecosystem of integrations.

AspectDetails
ArchitectureJSON-RPC 2.0 transport, similar to Language Server Protocol (LSP)
PrimitivesServers: Prompts, Resources, Tools; Clients: Roots, Sampling
SDKsPython, TypeScript, C#, Java (97M+ monthly SDK downloads)
Pre-built Servers10,000+ published servers (Google Drive, Slack, GitHub, Git, Postgres, Puppeteer)
AdoptionClaude, ChatGPT, Gemini, Cursor, VS Code, Microsoft Copilot
GovernanceDonated to Linux Foundation’s Agentic AI Foundation (AAIF) Dec 2025
Co-foundersAnthropic, Block, OpenAI (with support from Google, Microsoft, AWS, Cloudflare)

In December 2025, MCP became a founding project of the newly created Agentic AI Foundation (AAIF), a directed fund under the Linux Foundation. The donation ensures MCP “stays open, neutral, and community-driven as it becomes critical infrastructure for AI.” OpenAI officially adopted MCP in March 2025, integrating the standard across its products including the ChatGPT desktop app.

The rapid uptake of MCP—with over 10,000 published servers and 97M+ monthly SDK downloads—suggests growing consensus around standardized tool integration. This standardization has dual implications for safety: it enables more consistent monitoring and security practices, but also accelerates the proliferation of tool-using capabilities across the AI ecosystem. Bloomberg noted that MCP provides “the essential connective layer required” for agentic AI systems that “do far more than simple question-answering.”

As of late 2025, tool use capabilities have reached several significant milestones. On OSWorld, AI agents now achieve superhuman performance (76.26% vs. 72% human baseline). Claude Opus 4.5 achieved 80.9% on SWE-bench Verified and demonstrated the ability to work autonomously for 30+ hours while maintaining focus on complex multi-step tasks. In one demonstration, Claude Sonnet 4.5 autonomously rebuilt Claude.ai’s web application over approximately 5.5 hours with 3,000+ tool uses.

Despite accuracy improvements on benchmarks, efficiency remains a significant limitation. Research on OSWorld-Human reveals that even high-performing agents take 1.4-2.7x more steps than humans to complete tasks. What humans can accomplish in 30 seconds might take an agent 12 minutes—primarily because 75-94% of the time is spent on planning and reflection calls to large AI models.

Safety research has not kept pace with capability development. OpenAI’s December 2025 admission that prompt injection “may never be fully solved” represents a significant acknowledgment. While defensive approaches are advancing—including Google DeepMind’s CaMel framework (which treats LLMs as untrusted elements) and Microsoft’s FIDES (using information-flow control)—no production-ready solution exists for the fundamental vulnerability.

The economic incentives for tool use development remain exceptionally strong. OpenAI’s GPT-5 leads MCPMark performance at approximately $127.46 per benchmark run, compared to Claude Sonnet 4 at $152.41. Organizations recognize the potential for significant productivity gains through automated digital workflows, creating pressure for rapid deployment even before safety questions are fully resolved.

Several critical uncertainties will shape the development of tool-using AI systems over the coming years. The scalability of current training approaches remains unclear—while supervised fine-tuning and reinforcement learning have produced impressive demonstrations, it’s uncertain whether these methods can reliably teach agents to use arbitrary new tools or adapt to changing interfaces without extensive retraining.

The fundamental question of AI control in tool use contexts presents perhaps the most significant uncertainty. Current approaches to AI safety were developed primarily for language models that could only provide advice; extending these techniques to autonomous agents presents novel challenges that may require entirely new safety paradigms. The effectiveness of proposed solutions like constitutional AI, interpretability research, and formal verification methods for tool-using agents remains largely untested.

The interaction between tool use capabilities and other AI advances creates additional uncertainty. As models become more capable of long-term planning, steganography, and deception, the risks associated with tool use may increase non-linearly. Conversely, advances in AI safety research may provide new tools for monitoring and controlling autonomous agents.

Economic and regulatory responses will significantly influence the development trajectory. Industry self-regulation, government oversight, and international coordination efforts could substantially alter the pace and direction of tool use development. However, the dual-use nature of these capabilities makes targeted regulation challenging without hampering beneficial applications.

The technical question of whether safe, beneficial tool use is possible at scale remains open. While current systems demonstrate both impressive capabilities and significant safety challenges, it’s unclear whether fundamental barriers exist to creating reliable, beneficial tool-using agents or whether current problems represent engineering challenges that will be resolved through continued research and development.

DateEventSignificance
June 2023OpenAI introduces function callingEstablishes structured API invocation pattern for LLMs
Nov 2023GAIA benchmark releasedFirst comprehensive test for general AI assistants with tool use
Apr 2024OSWorld benchmark published (NeurIPS 2024)Standardized evaluation for computer control agents
Aug 2024SWE-bench Verified releasedHuman-validated coding benchmark; collaboration with OpenAI
Oct 2024Anthropic launches Computer Use betaFirst frontier model with direct desktop control
Nov 2024Model Context Protocol announcedOpen standard for AI-tool integration
Dec 2024Claude 3.5 Sonnet achieves 49% on SWE-benchSignificant jump in agentic coding capability
Jan 2025OpenAI launches OperatorBrowser-based agentic AI with Computer-Using Agent (CUA) model
Feb 2025Claude 3.7 reaches 28% on OSWorldTop leaderboard position at release
Mar 2025OpenAI officially adopts MCPIntegration across ChatGPT desktop app
Apr 2025Google DeepMind introduces CaMel frameworkTreats LLMs as untrusted elements for security
July 2025ChatGPT agent mode launchedOperator integrated into main ChatGPT product
July 2025OSWorld-Verified releasedMajor benchmark updates, AWS parallelization support
Oct 2025OSAgent achieves 76.26% on OSWorldFirst superhuman performance on computer control benchmark
Nov 2025Claude Opus 4.5 released80.9% on SWE-bench, 66.3% on OSWorld; 30+ hour autonomous operation
Dec 2025Scale AI releases SWE-bench ProHarder benchmark with private/commercial codebases
Dec 2025MCP donated to Linux Foundation AAIFIndustry standardization; co-founded by Anthropic, Block, OpenAI
Dec 2025OpenAI admits prompt injection “may never be fully solved”Critical security acknowledgment for agentic AI