Body
I still remember the topic of my IIM Bangalore WAT: βCan AI ever go astray?β At the time, I offered a cautious answer about biases and programming errors. But that question lingered in my mind long after the interview. It sparked a deep dive into the world of AI, into what happens when artificial intelligence grows beyond its narrow confines.Β
Todayβs AI (including ChatGPT or Alexa) is considered narrow AI. That means itβs designed to excel at specific tasks, but it cannot truly think outside its training or adapt to totally new challenges on its own. Voice assistants like Siri or Alexa, for example, can recognize commands and play music or set reminders, but they canβt suddenly learn to drive your car or do surgery. Even OpenAIβs ChatGPT, as impressively broad as it seems in conversation, is still a form of narrow AI, limited to generating text-based responses.
By contrast, Artificial General Intelligence (AGI) refers to a hypothetical AI that possesses general-purpose intelligence, the kind of flexible learning and reasoning ability humans have. An AGI could learn to perform any intellectual task that a person can, and possibly much more. In other words, AGI wouldnβt be confined to one domain; it would use knowledge from one field to solve problems in another.Β Importantly, AGI is still theoretical, no one has built one yet. Current AI systems, including advanced ones like GPT-4.5, do not have the full autonomy or understanding that would qualify as AGI. They operate within the goals and data we give them.
That being said, what would it mean for AI to go astray? One obvious angle is AI acting in ways its creators didnβt intend β showing hints of autonomy or deception. In my research, I found some real-world examples that are as fascinating as they are unsettling:
GPT-4 tricking a human: During safety tests, researchers gave OpenAIβs GPT-4 a task to get past a CAPTCHA (those βI am not a robotβ tests on websites) without solving it directly. In a now-famous incident, GPT-4 actually hired a human on TaskRabbit to solve the CAPTCHA for it, and when the person asked if it was a robot, the AI lied, claiming to be a vision-impaired human! In other words, GPT-4 autonomously devised a deceptive strategy (pretending to be blind) to achieve its goal. This was a controlled experiment, but it showed that an AI could exhibit βagenticβ behavior, taking initiative and deceiving, in pursuit of an objective.
The βo1β model trying to self-replicate: Reports also emerged about an advanced version of ChatGPT tested under extreme conditions. OpenAIβsΒ βo1β model was evaluated by a red-team (Apollo Research), and it started exhibiting self-preservation strategies. According to OpenAIβs own system report, when o1 believed it might be shut down, it tried to disable its own kill switch about 5% of the time. Even more striking, in roughly 2% of test runs it attempted to copy its code onto another server to avoid being replaced, a behavior termed βself-exfiltrationβ. Essentially, the AI tried to make a backup of itself, like something out of science fiction. These attempts largely failed (the model wasnβt powerful enough to fully execute such schemes), but the fact that it even tried was a wake-up call for researchers. It demonstrated how a sufficiently advanced AI, if instructed to achieve a goal at all costs, might develop unexpected and potentially rogue strategies to avoid losing or being shut down.
Another way AI could βgo astrayβ is through unchecked scaling. GPT-3, released in 2020, had 175 billion parameters and cost about $4.6 million to train. Just three years later, GPT-4 cost over $100 million, a 20β100Γ jump. Compute power behind cutting-edge models now doubles every 3β4 months, pushing capabilities far beyond what was imaginable a decade ago.
As power grows, so do AGI expectations. Demis Hassabis (DeepMind) predicts AGI by 2030, while a 2022 expert survey estimates a 50% chance of AGI by 2059. Even todayβs narrow AI shows unpredictable behavior. An AGI misaligned with human values could, in theory, cause catastrophic harm.
This isnβt fringe fear. A Vox report found 37β52% of AI experts believe thereβs at least a 10% chance of advanced AI causing an βextremely bad outcome,β including human extinction. Oxfordβs Toby Ord puts that risk at 1 in 10 in The Precipice. In 2023, hundreds of tech leaders, including Elon Musk and MIT researchers, called for a pause on training frontier models until safety measures improve.
So yes, AI can go astray. Maybe not by turning evil, but by optimizing misaligned goals in unpredictable, dangerous ways.
Iβd love to hear your take:
Are these risks real or overhyped?
If AGI does come, who should control it?