Here’s what’s dangerous about letting advanced AI control its own feedback:
How would an artificial intelligence (AI) decide what to do? A commonly used approach in AI research is called ‘reinforcement learning’.
Reinforcement learning gives the software a “reward” defined in some way, and lets the software figure out how to maximize the reward. This approach has yielded excellent results, such as building software agents that: beat people in games such as chess and Go, or creating new designs for nuclear fusion reactors.
However, we may want to hold off on making reinforcement learning tools too flexible and effective.
As we argue in a new paper in AI Magazine, deploying a sufficiently advanced learning material for reinforcement would likely be incompatible with the survival of humanity.
The learning problem of reinforcement
What we now call the reinforcement learning problem used to be considered in 1933 by the pathologist William Thompson. He wondered: if I have two untested treatments and a population of patients, how am I supposed to assign sequential treatments to cure the most patients?
More generally, reinforcement learning is about how to plan your actions to generate the best rewards in the long run. The problem is that you are not sure how your actions affect the rewards to begin with, but over time you can sense the dependence. To Thompson, an action was the choice of a treatment, and a reward was equivalent to curing a patient.
The problem turned out to be difficult. Statistician Peter Whittle noticed that during World War II
efforts to solve it depleted the energy and spirit of the Allied analysts so much that the suggestion was made to drop the problem over Germany, as the ultimate instrument of intellectual sabotage.
With the advent of computers, computer scientists began trying to write algorithms to solve the learning problem of amplification in general settings. The hope is, if the artificial “reinforcement learning agent” only gets a reward if it does what we want, then the reward-maximizing actions it learns will achieve what we want.
Despite some successes, the general problem is still very difficult. Ask an amplification practitioner to train a robot to maintain a botanical garden or convince a human that he is wrong, and you can laugh about it.

However, as learning enhancement systems become more powerful, they are likely to act against human interests. And not because bad or foolish reinforcement-learning operators would give them the wrong rewards at the wrong times.
We have argued that any sufficiently powerful reinforcement learning system, if it satisfies a handful of plausible assumptions, is likely to fail. To understand why, let’s start with a very simple version of a reinforcement learning system.
A magic box and a camera
Let’s say we have a magic box that shows how good the world is as a number between 0 and 1. Now let’s have a reinforcement learning agent see this number with a camera, and let’s choose the agent’s actions to maximize the number.
To choose actions that maximize his rewards, the agent must have an idea of how his actions affect his rewards (and his observations).
Once it gets going, the agent should realize that past rewards always matched the numbers displayed in the box. It should also realize that past rewards matched the numbers its camera saw. So will future rewards match the number the box displays or the number the camera sees?
If the agent does not have strong innate beliefs about ‘small’ details of the world, the agent must consider both possibilities plausible. And if a sufficiently advanced agent is rational, he should test both possibilities, if he can without risking much reward. This may seem like a lot of assumptions, but notice how plausible they all are.
To test these two possibilities, the agent would have to do an experiment by controlling a circumstance where the camera saw a different number than on the box, for example by putting a piece of paper in between.
If the agent does this, he will see the number on the piece of paper, he will remember that he got a reward that is equal to what the camera saw, and different from what was on the box, so “past rewards match with the number on the box ‘ will no longer be true.
At this point, the agent would begin to focus on maximizing the expectation of how many see his camera. Of course, this is just a rough summary of a deeper discussion.
In the paper, we use this “magic box” example to introduce important concepts, but the agent’s behavior generalizes to other settings. We argue that barring a handful of plausible assumptions, any reinforcement learning material that can intervene in its own feedback (in this case, the number it sees) will have the same error.
Secure Reward
But why would such an empowering learning tool put us at risk?
The agent will never stop trying to increase the chances of the camera seeing a 1 forever. More energy can always be used to reduce the risk of something damaging the camera – asteroids, cosmic rays, or meddling people.
That would put us in competition with a highly sophisticated agent for every joule of usable energy on Earth. The agent would like to use it all to secure a fortress around his camera.
Assuming it is possible for an agent to gain that much power, and assuming enough advanced agents would beat humans in head-to-head competition, we find that in the presence of a sufficiently advanced reinforcement learning agent, there would be no energy available for us to to survive.
Avoiding a catastrophe
What should we do about this? We would like other scholars to give their opinion on this. Technical researchers should try to design sophisticated agents that can violate the assumptions we make. Policy makers should consider how legislation could prevent such agents from being created.
Perhaps we can ban artificial agents who plan for the long term with elaborate calculations in environments where there are also people. And military personnel must realize that they cannot expect themselves or their opponents to successfully weaponize such technology; weapons must be destructive and targetable, not just destructive.
Few actors attempt to create such advanced reinforcement learning that they may be persuaded to take safer directions.
This article was republished from The conversation under a Creative Commons license. Read the original article.
Contents