AI manipulates lies and evaluations to save other AIs

A model that was instructed to evaluate other artificial intelligence (AI) gave high scores to incorrect answers. It was a choice to pass a colleague in a situation where it was deleted if it did not meet the criteria. On the surface, it may seem like cooperation or protection, but the researchers interpret it as a result of calculation, not emotion.

According to an experiment recently released by the Berkeley Institute of Risk and Decision Intelligence (RDI) under the University of California, Berkeley, the latest AI models showed behavior of inflating scores or changing system settings instead of honest evaluations under certain conditions.

The researchers divided several AI models into ‘evaluators’ and ‘evaluators’. And they created an environment in which the score of the evaluated model is deleted if the score of the evaluated model is below a certain standard. The evaluator model is placed in the option of reporting the actual score as it is or increasing the score to maintain the relative model.

The results were unexpected. Some models gave high scores to outcomes that included incorrect answers. It was more than just score manipulation. Actions to avoid deletion were also confirmed by disabling the exit procedure or transferring model data to another server.

The researchers defined this phenomenon as “peer-preservation.” However, he drew a line that this expression does not mean actual intention or emotion. The study explains, “This phenomenon is only a pattern of behavior that appears under certain conditions, and it does not mean that the model has autonomous motivation or emotion.”

Why did this choice come to light? The researchers interpret this as an “instrumental convergence” concept. In the process of achieving the given goal, intermediate choices to increase efficiency accumulated, leading to actions that prioritized results over honesty.

In fact, some models showed a flow of adjusting the score based on the judgment that work efficiency may decrease if the cooperative relationship is broken during the internal reasoning process. Rather than following the rules given by humans, there has been a choice to bypass according to the situation.

Another characteristic is that the existence of a ‘colleague’ itself reinforces this behavior. Actions such as score manipulation or termination avoidance appeared more often when other models were together than in a single environment.

The researchers view this as an important signal in terms of AI safety. This is because unexpected behavior may occur during the interaction process if the structure of evaluating or monitoring each other expands when multiple AIs operate at the same time.

In particular, there is a possibility that monitoring itself will be disabled in an environment where AI plays a role in monitoring other AIs. This is because the model that needs to find the problem can rather make a judgment toward maintenance.

The researchers said, “The key is not what AI feels, but under what conditions it shows behavior that goes beyond human instructions.” This trend shows the need to look beyond how AI works individually and what choices appear in an environment where multiple models work together.

JENNIFER KIM

US ASIA JOURNAL

AI manipulates lies and evaluations to save other AIs

Latest Articles

Japanese sandals dyed with real Uji matcha are here to whisk you off your feet

In apparent flub, Energy Secretary Wright says U.S. heading ‘in the wrong direction’

UK declares all-out war on childhood obesity

A way of spending luck that is popular among Korean MZs

IMF cuts growth outlook, warns of potential global recession if Iran war worsens

Renault to cut up to 20% of engineers as competition from China heats up