Researchers Build AI Agent That Solves Cyber Challenges Autonomously
A team from NYU Tandon and NYU Abu Dhabi has developed an AI agent capable of autonomously tackling real-world cybersecurity tasks, outperforming previous systems by a wide margin. The tool, called EnIGMA, was introduced at ICML 2025 and represents a leap in using large language models for automated vulnerability assessment.
Built on top of SWE-agent, EnIGMA was redesigned to work with tools typically used in cybersecurity, like debuggers and network analyzers. These visual programs had to be translated into formats the AI could interpret. “Large language models process text only, but these interactive tools with graphical user interfaces work differently,” explained NYU Ph.D. student Meet Udeshi, adding that “we had to restructure those interfaces to work with LLMs.”
The team trained EnIGMA on custom-built Capture The Flag (CTF) benchmarks, designed to simulate real-world exploits. The agent achieved state-of-the-art results on 390 challenges, solving three times more problems than previous agents. According to Udeshi, “Claude 3.5 Sonnet from Anthropic was the best model, and GPT-4o was second at that time.”
Co-author Minghao Shao described how they created a new data loader to process CTF inputs into the model. The system also used specialized prompts to navigate tasks autonomously, allowing it to iterate until a solution was found. One unexpected outcome was the discovery of a behavior they called “soliloquizing,” where the AI hallucinates steps without actual interaction, which is a potential concern for AI safety.
The researchers stressed the dual-use implications of their work. While EnIGMA can boost defense capabilities, it could also be misused. The team has notified Meta, OpenAI, and Anthropic about the findings. Funding came from Open Philanthropy, Oracle, NSF, the Department of Energy, and others.
With future potential in areas like ICS security and quantum code generation, EnIGMA may signal a new phase in how autonomous agents handle cyber threats, without needing human babysitting every step of the way.
React to this headline: