The ability of AI models to perform end-to-end, multi-stage penetration tests that match the capabilities of humans undertaking the same tasks has improved dramatically in recent months, according to new benchmarks published by the UK government’s AI Security Institute (AISI). In November 2025, the difficulty of cyber tasks the best models could complete was doubling…
Tag: perform
AI, Compliance, Global Security News
AI is ready to take over Python programming, but not much else
Tests of how well 19 large language models (LLMs) complete and perform complicated multi-step tasks has shown that they are both error-prone and, in many cases, unreliable. The findings are contained a preprint paper, LLMs Corrupt Your Documents When You Delegate, written by Microsoft researchers Philippe Laban, Tobias Schnabel and Jennifer Neville based on a…
AI, Cybersecurity, Exploits, Global Security News, Government & Policy
Flaw in Claude’s Chrome extension allowed ‘any’ other plugin to hijack victims’ AI
As businesses and governments turn to AI agents to access the internet and perform higher-level tasks, researchers continue to find serious flaws in large language models that can be exploited by bad actors. The latest discovery comes from browser security firm LayerX, involving a bug in the Chrome extension for Anthropic’s Claude AI model that…
AI, Global Security News
Perplexity’s new Computer agent will run other agents for you
Perplexity says its new Perplexity Computer service can perform complex, multi-step tasks on behalf of human users, by organizing the tasks that are needed and creating the software agents required to fulfill the process. Users begin by describing their desired outcome, the company said, then, “Perplexity Computer breaks it into tasks and subtasks, creating sub-agents…
AI, Cybersecurity, Global Security News
AI agents still need humans to teach them
AI agents need skills — specific procedural knowledge — to perform tasks well, but they can’t teach themselves, a new research suggests. The authors of the research have developed a new benchmark, SkillsBench, which evaluates agentic AI performance on 84 tasks across 11 domains including healthcare, manufacturing, cybersecurity and software engineering. The researchers looked at…
Global Security News, malware
Android Malware Hijacks Google Gemini to Stay Hidden
A new Android malware implant using Google Gemini to perform persistence tasks was discovered on VirusTotal and analyzed by ESET
agentic ai, AI, Global Security News, News, openai, oracle
OpenAI Frontier organizes AI agents under one system
OpenAI introduced Frontier, a platform designed to organize AI agents that perform business tasks within internal systems and workflows. The platform connects data from multiple internal systems including customer relationship management tools, ticketing platforms, and data warehouses. This integration creates a shared knowledge layer that allows AI agents to understand business processes and decision points…
