assistant-axis
Author: safety-research
The Assistant Axis is a direction in activation space that captures how "Assistant-like" a model's behavior is. Models can drift away from the Assistant during conversations—sometimes toward bizarre or harmful personas. This repo contains a pipeline for generating the Assistant Axis and notebooks for monitoring and steering with it.