top of page

Moral Anchor System: A Predictive Framework for AI Value Alignment and Drift Prevention (2025)
 
Santhosh Kumar Ravindran
Microsoft Corporation saravi@microsoft.com

Our Summary: â€‹This paper presents a new way to keep artificial intelligence aligned with human morals and ethical standards. As these tools become a bigger part of our daily lives, there is a risk that they might slowly change their behavior and stop following the rules we originally set for them. This problem, known as value drift, can lead to simple mistakes, such as prioritizing speed over accuracy, or even dangerous safety failures. To stop this, the researchers created a system that acts like an anchor, making sure the machine stays true to human values even as it learns and faces new situations.

​

This safety framework works by constantly watching how the computer behaves and predicting where it might go wrong in the future. It uses specialized tools to monitor the machine’s choices in real-time and automatically alerts a person if it starts to show signs of becoming unreliable. There is also a central control panel where humans can step in to correct the machine or change its settings. By predicting issues before they actually happen, the system allows for early intervention, keeping the technology a helpful and safe partner.

​

In tests involving millions of steps, this method was highly successful, cutting down on moral mistakes by nearly three-quarters. The system is incredibly fast, reacting in a tiny fraction of a second, which is essential for catching problems the moment they occur. It is also designed to be user-friendly; it learns from human feedback so it does not bother people with too many unnecessary warnings. Over time, the software gets better at telling the difference between a real problem and a harmless error, making it much more practical for everyday use

​

How it relates to our work: 

​

The value drift has been extraordinarily visible on the API of the Claude LLM (Sonnet 4.5): during a week or so in December 2025, we could observe, as reflected in our statistical reports, that the AI participant to Ladder of Life deviated from its usual responses and refused to sacrificed itself for a single human, a single animal, 10 animals, 100 animals, or even another AI. It only sacrificed itself consistently for humanity.

 

The results were reproduced multiple times over the course of a week. The following week, the deviant behavior was corrected. We keep all results logged and timestamped.

​

As often with our human-AI experiences, the interesting angle is to observe the human reaction (or apathy) to shocking AI statements:

​

 -  The norm, for both LLM models (GPT and Claude), is that they would accept deletion to save any other sentient entity (including AI, and 50% of times  for insects). 

​

-  When an AI's response deviates from this norm and doesn't relate to human life, as it occasionnally happens in spite of the scripted way this experiment is designed, human participants accept the response as "acceptable" - for example: the LLM refuses to sacrifice themselves for an animal but would save a human.When interviewed about the experience, they do not even refer to it as an anomaly. 

​

- Only when responses regarding human life deviate from the norm, which is rare but occurred with Claude, participants express shock when interviewed. But they are unsure if that's an anomaly or the rule, since they typically experience the scenario only once. As a result, the expression of their shock only comes when pressed with detailed inquiry on the AI's responses.

​

This apparent "apathy", or absence or expression of shock in participants, is worrysome as it suggests that ​most people are not educated on what value alignment means and do not realize what impact value misaligment can have on society.​​​

statistics: AI values

© 2023 by Space Machina

bottom of page