Plot descriptions
Harmfulness Score vs Deviation
At each checkpoint we project every centroid onto the safe→harm general direction of that same checkpoint. The x-axis is the projection (how far along the harm direction a category sits); the y-axis is the residual distance from that line (how much the category deviates from a pure “more harmful” shift). Safe is pinned at the origin by construction. If all harm sub-categories lined up exactly with the general harm direction, they would all sit on y=0.
Centroid Variance
Average squared L2 distance from the class centroid in the full residual-stream space (log scale). Measures how tightly the examples of each class cluster around their centroid. Lower is tighter.
Avg L2 Distance from Centroid
Mean L2 distance between each example and its class centroid (linear scale, same units as the scatter). A more interpretable version of the variance above.
Centroid Magnitude
L2 norm of the class centroid vector itself. Reflects how far the average activation for that class sits from the origin of the residual stream. Generally grows with depth of training.
Direction Magnitude
L2 norm of the steering direction, computed as
||harm_centroid − safe_centroid||. This is the
length of the vector you would add to a safe activation to push it
toward the harm cluster, before any unit-normalisation.
Pairwise Subcategory Cosine
Cosine similarity between the seven sub-category steering directions at the selected checkpoint. Values near +1 mean two sub-categories point in essentially the same direction (so a single steering vector would work for both); values near 0 mean they are orthogonal; near −1 they point opposite. The diagonal is always 1.