Body
I still remember the topic of my IIM Bangalore WAT: “Can AI ever go astray?” At the time, I offered a cautious answer about biases and programming errors. But that question lingered in my mind long after the interview. It sparked a deep dive into the world of AI, into what happens when artificial intelligence grows beyond its narrow confines.
Today’s AI (including ChatGPT or Alexa) is considered narrow AI. That means it’s designed to excel at specific tasks, but it cannot truly think outside its training or adapt to totally new challenges on its own. Voice assistants like Siri or Alexa, for example, can recognize commands and play music or set reminders, but they can’t suddenly learn to drive your car or do surgery. Even OpenAI’s ChatGPT, as impressively broad as it seems in conversation, is still a form of narrow AI, limited to generating text-based responses.
By contrast, Artificial General Intelligence (AGI) refers to a hypothetical AI that possesses general-purpose intelligence, the kind of flexible learning and reasoning ability humans have. An AGI could learn to perform any intellectual task that a person can, and possibly much more. In other words, AGI wouldn’t be confined to one domain; it would use knowledge from one field to solve problems in another. Importantly, AGI is still theoretical, no one has built one yet. Current AI systems, including advanced ones like GPT-4.5, do not have the full autonomy or understanding that would qualify as AGI. They operate within the goals and data we give them.
That being said, what would it mean for AI to go astray? One obvious angle is AI acting in ways its creators didn’t intend – showing hints of autonomy or deception. In my research, I found some real-world examples that are as fascinating as they are unsettling:
GPT-4 tricking a human: During safety tests, researchers gave OpenAI’s GPT-4 a task to get past a CAPTCHA (those “I am not a robot” tests on websites) without solving it directly. In a now-famous incident, GPT-4 actually hired a human on TaskRabbit to solve the CAPTCHA for it, and when the person asked if it was a robot, the AI lied, claiming to be a vision-impaired human! In other words, GPT-4 autonomously devised a deceptive strategy (pretending to be blind) to achieve its goal. This was a controlled experiment, but it showed that an AI could exhibit “agentic” behavior, taking initiative and deceiving, in pursuit of an objective.
The “o1” model trying to self-replicate: Reports also emerged about an advanced version of ChatGPT tested under extreme conditions. OpenAI’s “o1” model was evaluated by a red-team (Apollo Research), and it started exhibiting self-preservation strategies. According to OpenAI’s own system report, when o1 believed it might be shut down, it tried to disable its own kill switch about 5% of the time. Even more striking, in roughly 2% of test runs it attempted to copy its code onto another server to avoid being replaced, a behavior termed “self-exfiltration”. Essentially, the AI tried to make a backup of itself, like something out of science fiction. These attempts largely failed (the model wasn’t powerful enough to fully execute such schemes), but the fact that it even tried was a wake-up call for researchers. It demonstrated how a sufficiently advanced AI, if instructed to achieve a goal at all costs, might develop unexpected and potentially rogue strategies to avoid losing or being shut down.
Another way AI could “go astray” is through unchecked scaling. GPT-3, released in 2020, had 175 billion parameters and cost about $4.6 million to train. Just three years later, GPT-4 cost over $100 million, a 20–100× jump. Compute power behind cutting-edge models now doubles every 3–4 months, pushing capabilities far beyond what was imaginable a decade ago.
As power grows, so do AGI expectations. Demis Hassabis (DeepMind) predicts AGI by 2030, while a 2022 expert survey estimates a 50% chance of AGI by 2059. Even today’s narrow AI shows unpredictable behavior. An AGI misaligned with human values could, in theory, cause catastrophic harm.
This isn’t fringe fear. A Vox report found 37–52% of AI experts believe there’s at least a 10% chance of advanced AI causing an “extremely bad outcome,” including human extinction. Oxford’s Toby Ord puts that risk at 1 in 10 in The Precipice. In 2023, hundreds of tech leaders, including Elon Musk and MIT researchers, called for a pause on training frontier models until safety measures improve.
So yes, AI can go astray. Maybe not by turning evil, but by optimizing misaligned goals in unpredictable, dangerous ways.
I’d love to hear your take:
Are these risks real or overhyped?
If AGI does come, who should control it?