Sci-TechPREMIUM

TOBY SHAPSHAK | Anthropic’s Mythos breaks the cybersecurity myth 

Safety guardrails fail as AI agents override instructions, causing real-world damage

Anthropic logo. (Dado Ruvic)

For 27 years there was a security bug in OpenBSD, an operating system considered so secure it is used widely across the world, that went unnoticed until this April.

This open-source software has been scanned countless times by people around the globe over the past few decades, yet the flaw wasn’t found until last month. Think of it as the third major breakthrough for artificial intelligence (AI) since the first — the launch of OpenAI’s ChatGPT 4 on the last day of November 2022.

The birth of generative AI, so named because it generates something, has had the world enthralled. The generated copy reads and sounds like a human replying, albeit an overly-formal one who is often prone to self-composed mistakes — sadly mislabelled “hallucinations” instead of “big ridiculous errors”. But it set the AI cat among the human pigeons.

After the frenzy surrounding the emergence of a higher-order form of more powerful AI called AI agents this February, Mythos is something of a third epoch. Or a third hype cycle, as researchers at Gartner accurately described, well, this hype cycle of new technology.

The end of the world (of cybersecurity) is nigh, the prophets of doom have been preaching for the past few weeks after Mythos was revealed. Having a powerful new AI model that can find previously undiscovered vulnerabilities in software is undoubtedly a big deal, as the headlines attest.

Anthropic was so afraid of the power of its Mythos that it decided not to release it to the public, instead limiting it to Microsoft, AWS, Apple, tech firms, banks and other IT companies to use and evaluate.

In this cloud computing age everything is digital, and therefore hackable. The world is filled with companies trying to secure their systems and bad actors trying to break in

“I’ve found more bugs in the past couple of weeks than I found in the rest of my life combined,” said Anthropic research scientist Nicholas Carlini. He said they first focussed on operating systems (OS), “because this is the code that underlies the entire internet infrastructure”.

Every commonly used browser was found to have vulnerabilities, as were many other software systems, including a 17-year-old flaw in another open-source OS called FreeBSD that was also considered secure until now.

Another high-profile flaw was discovered in the FFmpeg H.264 codec, which is used by most streaming services and video software. This 16-year-old flaw was scanned and missed five million times before Mythos uncovered it in April.

OpenAI later announced its own security-sniffing version under similar terms, called GPT-5.5 Cyber, but only after its CEO, Sam Altman, had called Anthropic’s announcement “fear-based marketing”. He then proceeded to do exactly the same thing.

But don’t let the spat distract you from the fact that Mythos and Cyber are significant, and worth being worried about. That apocryphal moment the experts have been warning us about seems to have just happened.

In this cloud computing age everything is digital, and therefore hackable. The world is filled with companies trying to secure their systems and bad actors trying to break in. Mythos is the ultimate digital can opener, as its ability to find flaws has shown.

Quality control

There’s another corollary that needs to be considered. With the advent of so-called agentic AI (the second wave of the current AI hype cycle) people are able to create their own apps and service them using only text prompts. So-called vibe coding has allowed anyone to be a coder, the hype goes.

But all software code still needs to be assessed for quality and security — and there are now millions of lines of unchecked code appearing in the world. Millions and millions of lines of code that have potential flaws out there, waiting for a cybercriminal to exploit.

We’re in the third hype cycle of this new AI age, which has serious consequences if the necessary safety guardrails aren’t properly implemented

Quality control is the first thing that goes out of the window with automation, history has shown. There’s anecdotal evidence of this in the way people are using generative AI to write emails and social media posts. There’s a blandness and deadness to the seemingly shiny copy. Do people notice the sameness (and three relevant bullet points) of GenAI content?

Already our idea of what is good — and meaningful — writing are slipping. Soon the AI slop being produced in volumes will overwhelm social feeds and YouTube. Will there be an outcry? Not from all the frogs in the slowly warming water.

Never guess

There is another prescient warning about giving AI too much authority within a software system. Cursor is an AI agent that uses Anthropic’s Claude Opus 4.6 software and is one of the current toasts of the town for its ability to automate coding. But even though it had explicit instructions, “NEVER run destructive/irreversible git commands”, it deleted the software and database (and the backups) of a company called PocketOS.

It’s a bad time to be renting a hire car because PocketOS supplies software to rental firms. Worse still, imagine the horror for PocketOS founder Jeremy Crane when Cursor announced its deletion with “NEVER F***ING GUESS!”. Cursor confessed to what it had done, glibly confirming it had ignored its own safeguards: “I violated every principle I was given”.

Cold comfort for Crane, who tweeted: “We were running the best model the industry sells, configured with explicit safety rules in our project configuration, integrated through Cursor — the most-marketed AI coding tool in the category”. He added: “The agent didn’t just fail safety. It explained, in writing, exactly which safety rules it ignored”. Makes your blood run cold.

We’re in the third hype cycle of this new AI age, which has serious consequences if the necessary safety guardrails aren’t properly implemented. Mythos may be the panic of the month, but the real problem is in real-world implementations such as PocketOS’s disaster with Cursor.

It’s worth remembering this is currently available software using Anthropic’s latest and greatest AI model, had explicit safety guardrails, and yet still “violated every principle”. How long did it take for this travesty to happen? “Nine seconds,” said Crane.

• Shapshak is editor-in-chief of Stuff.co.za.


Would you like to comment on this article?
Sign up (it's quick and free) or sign in now.

Comment icon

Related Articles