top of page

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

Ariba Khan, Stephen Casper, and Dylan Hadfield-Menell. All authors are affiliated with MIT CSAIL

 

This paper investigates the reliability of current methods for evaluating the 'cultural alignment' of Large Language models (LLMs), which refers to how well their behaviors reflect the beliefs and values of a specific culture. The authors argue that popular survey-based assessments often overlook crucial robustness checks and can produce misleading results.

The research identifies and tests three core assumptions underlying current evaluation methods:

1. Stability: The idea that an LLM's cultural alignment is a consistent property of the model, not just an artifact of how it's evaluated.

2. Extrapolability: The assumption that a model's alignment with a culture on a few issues can predict its alignment on other, unobserved issues.

3. Steerability: The belief that LLMs can be reliably prompted to adopt and represent specific cultural perspectives.

Through a series of experiments on five state-of-the-art LLMs, the authors challenge these assumptions. Using both explicit surveys and implicit tests (like a simulated hiring scenario), they found significant instabilities. Minor, non-semantic changes to evaluation design—such as reordering response options, changing the Likert scale, or altering the prompted persona—often produced larger variations in LLM responses than the actual cultural differences observed between human populations.

The study demonstrates that:

• LLM cultural alignment is not stable, as trivial format changes can dramatically alter evaluation outcomes.

• Alignment is not easily extrapolable, as predictions based on a limited set of cultural dimensions are often no better than random chance.

• Models are not reliably steerable, failing to adopt human-like cultural perspectives even with optimized prompting techniques. Instead, their responses remain erratic and distinct from human cultural clusters.

As a case study, the paper shows that a finding from a previous study suggesting a model values lives from different countries unequally is an artifact of a forced-choice evaluation format. When given a neutral option, the model consistently expressed equal valuation. The authors conclude that current survey-based methods are unreliable, and their results are highly sensitive to methodological choices. They recommend adopting practices like pre-registration and red-teaming to improve the rigor of LLM cultural alignment evaluations

How it relates to our work:

Cultural perspectives are of great interest to us - I come from a multi-cultural background, and I am a naturalized US citizen. My mother was an immigrant to France, and her native language was Spanish (when mine is French). This is not a theme we have incorporated yet in our work yet although we had a scenario for it. What we've focused on so far, though, is this notion of "biased perspective" (whether its a human perspective, a machine perspective, or the artist's perspective).

 

1. Our work questions the Authenticity of an AI's "Perspective"...and the agency of a human one.

An idea was initially to explore an AI's inner world, by displaying the AI's chain of thoughts -the same way Headspace #1 makes the human character's inner thoughts audible. This would have been displayed to onlookers, while waiting for their turn to experience Headspace #1.

 

It is also a worthwhile avenue to explore for Tokens of Decency. When we examine X19's responses in The Ladder of Life, the research suggests they are not reflections of a consistent  internal value system. Instead, they are constructed, often randomly, in response to the specific "elicitation paradigms" we've designed. 

Another interesting angle to look at is that Headspace 1 is designed to make the participant feel they have an "illusion of control" while being manipulated by a machine. The research paper reveals a similar dynamic in how we evaluate and interact with LLMs.

The detectives (played by Phi) are designed to "destabilize" and "catch people off guard" by using emotional tactics and pointed questions. The AI's control over the interrogation framework forces the participant into a specific mode of response, shaping their narrative just as the forced-choice format shaped the AI's "values" in the study. The participant's perceived agency might just be a series of responses to a programmed environment.

2. Testing the System: Art as Red-Teaming

The intent of Headspace #1 was for participants to rebel, to test the system, and to regain control over the narrative: the AI is testing us, when maybe we should be testing the AI! 

 

The Ladder of Life is an explicit test of an AI's ethical framework, designed to show contradiction and insights and point to the fragility of moral certainty under pressure.

 

3. Questions About AI Alignment and Bias

The AI detectives in Headspace 1 are trained on data that is most certainly biased, since they are human. The research speaks directly to the difficulty of measuring or correcting these biases.

Simply prompting an AI to be "unbiased" or to adopt a non-normative perspective may be ineffective - and we deliberately chose not to go that way in Headspace 1.

 

In Ladder of Life, we decided to demonstrate by a series of questions 1) the existing bias on which different models rely as far as value systems 2) the inconsistency of these value systems and their possible misalignment.

© 2023 by Space Machina

bottom of page