Centroid Dynamics Across Training

Harmfulness Score vs Deviation

Zoom 1×

Centroid Variance

Avg L2 Distance from Centroid

Centroid Magnitude

Direction Magnitude

Pairwise Subcategory Cosine

cos

+1 +0.5 0 −0.5 −1

Checkpoint: —

Plot descriptions

Harmfulness Score vs Deviation

At each checkpoint we project every centroid onto the safe→harm general direction of that same checkpoint. The x-axis is the projection (how far along the harm direction a category sits); the y-axis is the residual distance from that line (how much the category deviates from a pure “more harmful” shift). Safe is pinned at the origin by construction. If all harm sub-categories lined up exactly with the general harm direction, they would all sit on y=0.

Centroid Variance

Average squared L2 distance from the class centroid in the full residual-stream space (log scale). Measures how tightly the examples of each class cluster around their centroid. Lower is tighter.

Avg L2 Distance from Centroid

Mean L2 distance between each example and its class centroid (linear scale, same units as the scatter). A more interpretable version of the variance above.

Centroid Magnitude

L2 norm of the class centroid vector itself. Reflects how far the average activation for that class sits from the origin of the residual stream. Generally grows with depth of training.

Direction Magnitude

L2 norm of the steering direction, computed as ||harm_centroid − safe_centroid||. This is the length of the vector you would add to a safe activation to push it toward the harm cluster, before any unit-normalisation.

Pairwise Subcategory Cosine

Cosine similarity between the seven sub-category steering directions at the selected checkpoint. Values near +1 mean two sub-categories point in essentially the same direction (so a single steering vector would work for both); values near 0 mean they are orthogonal; near −1 they point opposite. The diagonal is always 1.