[ AI/antropic-3 ]질문에 단어 키워드를 바꾸면 어떻게 될까? (feat. Claude 3.5 Haiku)

Soo_Parkle 2025. 6. 19. 18:00

만약 모델이 사용자로부터 'small의 반대 단어는?'이라는 질문을 받는다면?

영어에 small의 반대는 big이다.

프랑스어에 pretit의 반대는 grand이다.

중국어에 小의 반대는 大이다.

The opposite of "small" is_____?

모델은 우리 뇌처럼 다양한 언어에서 동일한 뜻을 가진 것에 활성화게 됐다.
마치 우리 뇌가 고양이를 들으면, 영어로 cat이라는 부분이 활성화 되고,
프랑스어 chat이 활성화 되고, 중국어 猫 이 활성화 된다.

다만, 여기서 직접적으로 (답변에) 결정 영향을 미치는 node는 별개다.

위 실험에서는

We make this claim on the basis that (1) the feature visualizations show that they activate in many languages, (2) 20 out of 27 of the features in multilingual nodes are active across all three prompts. However, we note that the set of features that are influential to the model’s response varies quite a bit by prompt (only 10/27 appear in the pruned attribution graphs for all three prompts).

'작은'에 반대되는 상황은 27개 중 20개 활성화 된다. 그러나 '큰'이라는 단어에 영향을 미치는 것은 10개이다.

The high-level story of each is the same: the model recognizes, using a language-independent representation 13 , that it's being asked about antonyms of “small”. This triggers antonym features, which mediate (via an effect on attention – corresponding to dotted lines in the figure) a map from small to large. In parallel with this, open-quote-in-language-X features track the language, 14 and trigger the language-appropriate output feature in order to make the correct prediction (e.g., “big”-in-Chinese). However, our English graph suggests that there is a meaningful sense in which English is mechanistically privileged over other languages as the “default”. 15
We can think of this computation as involving three parts: operation (i.e. antonym), operand (i.e. small), and language. In the following sections, we will offer three experiments demonstrating that each of these can be independently intervened upon. To summarize:

모델은 질의를 받으면 "small의 반대말은?"이라는 질문을 받으면,
먼저 언어에 상관없이 "이것은 반의어를 묻는 질문이구나"라고 인식한다.

그 다음 각 언어에 맞는 적절한 답변 ("큰", "big", "大" 등)을 생성하는 과정을 거친다.
흥미롭게도 연구진은 영어가 다른 언어들보다 "기본 언어" 역할을 한다는 점을 발견.

이는 마치 다국어를 구사하는 사람이 머릿속에서 먼저 영어로 생각한 후 다른 언어로 번역하는 것과 비슷한 패턴.

질문 -> (언어에 상관없이) -> 1) '반의어'에 가중치 -> 2) 영어로 먼저 생성 -> 3) 나라별 언어로 번역
*영어는 기본값

연산자 : 동의어는 결과가 어떨까?

In the middle layers of the model, on the final token position, there is a collection of antonym features that activate right before the model predicts an antonym or opposite of a recent adjective. We find a similar cluster of synonym features 16 at the same model depth on an English prompt A synonym of "small" is ".

모델의 중간층에서, 마지막 토큰 위치에는 반의어 특성들의 집합이 있다.
이들은 모델이 최근 형용사의 반의어나 반대말을 예측하기 직전에 활성화 된다.

These can be understood as synonym and antonym function vectors. Although the synonym and antonym vectors are functionally opposite, it is interesting to note that all pairwise inner products between synonym and antonym encoder vectors are positive and the minimum decoder vector inner product is only slightly negative.

인간 언어 상으론 반대어나 동의어는 반대 개념을 갖고 있지만,
모델은 특별한 차이를 보이지 않았다. 둘은 둘다 양의 값을 갖고 있었다.
오직 약간만이 음의 값을 갖고 있었다.

흥미로운 발견:
논리적으로 생각하면 반의어와 동의어는 정반대 기능이므로,
이들을 담당하는 "도구들"도 완전히 다를 것 같다.

하지만 실제로는 놀랍게도 이 도구들이 서로 매우 유사하다는 것을 발견했다.

이것이 의미하는 바:
AI가 "반대말 찾기"와 "비슷한 말 찾기"를 처리할 때, 완전히 다른 방식이 아니라 매우 유사한 메커니즘을 사용한다는 것.
마치 같은 기본 엔진을 사용하되, 마지막 단계에서만 방향을 바꾸는 것과 같다.

In addition to the model predicting the appropriate synonym, the downstream say-large nodes are suppressed in activation (indicated by the percentage) while upstream nodes remain unchanged. It is also worth noting that although our intervention requires unnatural strength (we have to apply 6× the activation in the synonym prompt), the crossover point of when the intervention is effective is fairly consistent across languages (about 4×).

상류(upstream): 물이 들어오는 곳 → 변화 없음
하류(downstream): 물이 나가는 곳 → 차단됨(변화가 이뤄짐)

모델 내부에서 정보 흐름을 인위적으로 조작했을 때,
앞쪽 처리는 그대로 두고 뒤쪽 출력만 바뀐다는 것이다.

그리고 이런 패턴이 언어에 상관없이 비슷하게 나타난다는 것은
다국어 모델의 내부 구조가 언어 간에 공통적인 메커니즘을 가지고 있음을 시사한다.

피연산자 : Small을 Hot을 바꾸면?

결론부터 말하면, Small이라는 단어를 Hot으로 바뀌면 바꾼다.
이 모델은 단어를 블록처럼 바꿀 수 있다는 것이다.

For our second intervention, we change the operand from “small” to “hot”. On the “small” token, there is a collection of early features that appear to capture the size facet of the word. Using an English prompt with the “small” token replaced by the “hot” token, we find similar features representing the heat-related facet of the word hot.

small -> hot으로 키워드 변경.

small : 크기과 관련된 집단들이 보임
hot : 온도와 관련된 집단들이 보임

As before, to validate this interpretation, we substitute the small-size features for the hot-temperature features (on the “small”/”petit”/”小” token). Again, despite the hot-temperature features being derived from an English prompt, the model predicts language-appropriate antonyms of the word “hot,” demonstrating a language-agnostic circuitry for the operand.

다른 언어에서도 비슷한 현상이 이뤄졌다.
더불어, 피연산자에 대한 언어 무관 회로를 보인다.

여기서 포인트는 군집을 바꾸면 정보가 바뀔 수 있다.

레고 블록 비유: AI 모델 내부가 마치 레고 블록처럼 구성되어 있다는 것입:

작업 블록: "반의어 찾기"
대상 블록: "크기" 또는 "온도"
언어 블록: "한국어", "영어", "중국어"

이 블록들을 서로 바꿔 끼워도 모델이 정상적으로 작동한다는 것은,
AI가 언어와 개념을 모듈화해서 처리한다는 강력한 증거이다.

요약과 쉽게 수학으로 비유하면,

일반적인 수학에서:

5 + 3 = 8
연산자(operation): + (더하기)
피연산자(operand): 5와 3 (계산 대상)

언어 처리를 "계산"으로 보는 관점:

"small의 반의어는?"이라는 질문을 다음 세 요소로 분해다:

1. 연산자(Operation) = "무엇을 할 것인가?"

반의어 찾기 (antonym)
동의어 찾기 (synonym)
번역하기 (translation)
정의하기 (definition)

2. 피연산자(Operand) = "무엇에 대해 할 것인가?"

"small" (작은)
"hot" (뜨거운)
"happy" (행복한)

3. 언어(Language) = "어떤 언어로 할 것인가?"

한국어
영어
중국어

결론

연구진은 AI 모델이 언어를 처리할 때 이 세 요소를 독립적인 모듈처럼 사용한다는 것을 증명

1단계: 모델이 "small"을 처리할 때 활성화되는 뉴런들을 관찰
2단계: 모델이 "hot"을 처리할 때 활성화되는 뉴런들을 관찰
3단계: "small" 위치에 "hot"의 활성화 패턴을 강제로 주입

결과: 모델이 "small" 대신 "hot"의 반의어를 출력

실험:
영어에서 추출한 모듈을 한국어에 적용
1. 영어 "hot"에서 온도 모듈 추출
2. 한국어 "작은" 위치에 온도 모듈 주입
3. 결과: AI가 "차가운"이라고 한국어로 답함!

강도:
1배 강도: 효과 없음
4배 강도: 효과 시작 (모든 언어에서 일관됨!)
6배 강도: 완전한 교체 성공

저작자표시 비영리 변경금지 (새창열림)