The risks of developing superhuman AI capabilities and not understanding them are unacceptable. So let’s have a look at the toolbox that we have today at our hands. This post gets fairly technical; if you find parts difficult to grasp, feed the entire article to an LLM and ask it to explain in a way that makes sense to you.

This practice or field of science is called mechanical interpretation. I personally got interested in this after reading Neel Nanda’s paper “Progress measures for grokking via mechanistic interpretability”. In that research, he found out that an LLM was performing modular additions with triggered entities in Fourier transformations. While it sounds fancy, I had no idea what it meant. My only interpretation was that the AI was not clearly doing what we expected.

So to perform a small test of my own, I taught a relatively simple Nemotron model to monitor network traffic and identify anomalies based on a handbook that I gave it. The answers were spot on and many would have been happy with that. Looking at the chain-of-thought revealed that my “training” was not really training the model to act as I wanted but rather gave it an idea what the trainer, me, wanted as output. The model found a different way to get the answers I wanted. So all I was able to communicate was my preferred output, but not the way to achieve it. Scary? Yes indeed.

The Investigation

So let’s walk through what I discovered when I put that trained Nemotron model under the microscope and what other techniques we have in our toolkit today. The model had been trained to monitor network traffic and identify attack patterns, anomalies, and exploitation of disclosed vulnerabilities – exactly the kind of work done in Security Operations Centers (SOCs).

During testing, I noticed a spike of false positive “C2-beacon” alerts for traffic coming from a new partner network. False positive C2 (Command and Control) beacon alerts occur when security tools mistakenly flag legitimate network traffic as malicious beaconing activity. Benign applications can produce beaconing-like behavior, such as regularly checking for updates, which can be misidentified by signature-based or behavior-based detection systems. This results in alert fatigue and can cause security teams to ignore real threats.

Let’s walk through what we can find out, starting from the cheapest and fastest way of testing and progressing toward more resource-intensive and complex techniques.

The Interpretability Toolkit

Quick Attribution Methods (Laptop-Level Compute)

Attention Pattern Viewing

This technique visualizes where each attention head “looks” in the input – effectively a heatmap of dependencies the model references when forming its prediction. It’s descriptive and useful for triage, but correlation ≠ causation. An example would be BERTViz.

Question: When the model alerts “C2-beacon” on these partner flows, what tokens does it stare at?

Answer: Several heads fixate on JA3 fingerprints (unique signatures of how software establishes encrypted connections) and ASN/geo tokens (identifiers for which organization or country owns the network), barely glancing at inter-arrival timing or burst patterns, a first hint of an identity shortcut rather than behavioral evidence.

The “identity shortcut hypothesis” suggests that instead of analyzing what the network traffic is actually doing (behavioral patterns like timing, frequency, payload characteristics), the AI is taking a shortcut by focusing on who or where it’s coming from (identity markers like specific network signatures or geographic regions). This is problematic because legitimate traffic from new partners can look suspicious based on identity alone, while actual malicious behavior might be missed if it comes from trusted sources.

Saliency / Gradient Methods

This technique uses gradients (e.g., Integrated Gradients, gradient×input, Grad-CAM style variants) to estimate which input features the prediction is most sensitive to; it’s fast triage but not strictly causal. An example would be Captum (Integrated Gradients).

Question: Which raw input features most sway these false positive (FP) decisions?

Answer: High sensitivity to partner network identifiers and connection signatures, with low sensitivity to timing patterns – confirming the AI is making decisions based on “who” rather than “what the traffic actually does.”

Linear Probes

This technique trains tiny classifiers on frozen layer activations to test where a concept is linearly present in the representation (presence does not imply use). An example would be logistic regression probes over saved activations.

Question: Where are ASN/JA3 identity and “beacon periodicity” encoded across layers?

Answer: ASN/JA3 are decodable in early and mid layers; periodicity becomes decodable in mid→late layers—so the model can represent the behavioral evidence it’s neglecting. The AI actually has the information it needs to make correct decisions, but it’s choosing to ignore it.

Concept Vectors (TCAV)

This technique builds a direction in activation space from small concept example sets and measures how moving along that direction changes the class score. An example would be TCAV-style concept sensitivity on transformer activations.

Question: Is “C2” more sensitive to identity signals or to periodicity on the FP cohort?

Answer: Strong positive sensitivity from identity→C2 and weak sensitivity from periodicity→C2—showing the model relies too heavily on where traffic comes from rather than how it behaves.

Mid-Level Analysis (Single GPU Compute)

Logit Lens / Tuned Lens

This technique decodes intermediate hidden states into provisional class probabilities so we can see how the model’s belief evolves through the layers; Tuned Lens adds a learned calibration so the layer-by-layer view is stable.

Question: At what depth does the model really commit to “C2,” and is that commitment gated by identity tokens (JA3/ASN)?

Answer: The AI starts leaning toward “malicious beacon” when it sees identity markers; later processing stages confirm this decision only when those same identity markers are present – showing the decision is made based on “who” the traffic comes from, not behavioral analysis.

Direct Logit Attribution (DLA)

This technique decomposes the final logit gap (e.g., C2 − benign) into contributions from individual blocks/heads using the readout map, telling us which components actually pushed the decision over the threshold. An example would be DLA implementations in TransformerLens-style tooling.

Question: Which blocks/heads are responsible for most of the C2 score on these false positives?

Answer: Specific AI components that focus on identity information are responsible for most of the “malicious beacon” score – identifying exactly which parts of the AI are causing the false alarms.

Concept Erasure (INLP/LEACE)

This technique identifies and removes the linear subspace encoding a concept (e.g., geo/ASN) and observes the behavioral impact to test dependence on that concept. An example would be INLP/LEACE subspace removal.

Question: If we erase ASN/geo information, do false positives collapse while true beacons remain?

Answer: False alarms drop sharply while real malicious traffic detections mostly persist – proving that identity signals are causing the false alarms.

Activation Patching / Interchange Interventions

This technique swaps targeted hidden activations between runs (e.g., from a benign or correctly flagged example into an FP) to test whether a specific layer/head/position is causally responsible. An example would be activation patching hooks via TransformerLens/PyTorch.

Question: Does overwriting identity head activations in FPs collapse the C2 score (and does injecting timing activations matter less)?

Answer: When we disable the AI components focused on identity, the false alarm disappears; when we only provide timing information, it has minimal effect – proving that identity processing is the direct cause of false alarms.

Sparse Autoencoders (SAEs) / Transcoders

This technique learns a sparse feature basis over activations so each example is explained by a few nameable features; transcoders are lighter-weight variants for broad coverage and live read/write. An example would be SAE training on residual streams with a transcoder readout.

Question: Which specific features fire inside the FP circuit, and what happens if we ablate or boost them?

Answer: The AI uses features like “Partner Network X” and “Cloud region Y” (irrelevant for security) versus “60-second periodicity” and “short burst beacon” (actual malicious behavior patterns). When we disable the irrelevant identity features, false alarms drop; when we enhance the behavioral features, detection of real threats improves.

Activation Steering (Activation Addition / Contrastive Vectors)

This technique adds a small steering vector at inference to up-weight desired features and down-weight shortcuts—reversible control without changing weights. An example would be contrastive steering vectors derived from SAE features.

Question: Can we bias decisions toward periodicity evidence and away from identity in production canaries?

Answer: We can adjust the AI to emphasize timing patterns while de-emphasizing network identity, which reduces false alarms on partner traffic while preserving detection of real threats.

Knowledge Editing (ROME, MEMIT, etc.)

This technique performs localized weight edits to update or remove a specific association (e.g., string → class) while minimizing collateral effects. An example would be ROME/MEMIT edits applied to the relevant MLP layer.

Question: Can we neutralize the association “JA3 1234 @ Partner X ⇒ C2” without harming detections that rely on periodicity/payload evidence?

Answer: After the edit, benign flows with JA3 1234 from the partner are no longer flagged, while real beacons that exhibit periodic timing continue to trigger.

Advanced Methods (Multi-GPU Compute)

Path Patching / Attribution Patching

This technique extends patching from points to full computation paths (residual → head → MLP), revealing which routes actually carry the decision signal. An example would be attribution patching code built on top of TransformerLens.

Question: Which end-to-end paths transmit the FP C2 signal?

Answer: The AI pathway that processes identity information carries most of the signal leading to false alarms, while the pathway that processes timing patterns contributes little to these mistakes.

Causal Scrubbing

This technique falsifies or confirms a circuit hypothesis by holding the hypothesized relevant features fixed while resampling everything else (and vice versa) to see if behavior tracks the claim. An example would be a causal scrubbing harness over controlled resampling datasets.

Question: Do FPs persist when only identity is held fixed, and vanish when identity is resampled even if timing stays beacon-like?

Answer: When we keep only identity information constant and randomize everything else, false alarms persist; when we keep timing patterns constant and randomize identity information, false alarms disappear; confirming that identity drives the false alarms.

ACDC (Automated Circuit Discovery)

This technique automatically prunes the computation graph to the smallest subgraph that still reproduces the target behavior (e.g., the FP logit gap), giving a minimal circuit. An example would be ACDC-style pruning pipelines.

Question: What minimal circuit recreates the FP behavior?

Answer: A minimal AI pathway focused on identity processing can recreate most of the false alarm behavior on test cases—showing exactly which parts of the AI are responsible for the problem.

Linear Parameter Decomposition (LPD)

This technique factors weight matrices into a small number of interpretable components (atoms), enabling git diff-like audits across checkpoints and scaling/removal of problematic components. An example would be an LPD pipeline over attention/MLP weight tensors.

Question: Is there a weight component that globally couples identity features to C2, and can we dial it down?

Answer: One component’s magnitude tracks FP rate across versions; scaling it down reduces FPs without harming periodicity-driven detections.

Industrial-Scale Methods (Cluster-Level Compute)

Training Time Instrumentation & Interpretability-Aware Training

This technique builds probes/SAEs/transcoders into the training loop to log feature firing, gate releases on white-box checks, and encourage sparse, modular internals that remain editable. An example would be fine-tuning with SAE/transcoder telemetry and sparsity regularizers.

Question: How do we prevent identity shortcuts from re-emerging in future fine-tunes?

Answer: We monitor identity vs. periodicity feature rates during training and block model promotion when shortcut features rise, keeping the beacon circuit evidence-driven.

From Mystery to Mechanism

Let’s be clear: you wouldn’t use all these techniques to debug one simple false positive. For quick triage of a problem like this, you’d pick maybe three methods at most – attention pattern viewing to see what the AI focuses on, saliency methods to identify which inputs matter most, and linear probes to check where concepts are encoded. These laptop-level techniques give you enough insight to understand and fix the immediate problem without burning through GPU budgets.

The broader point is that we now have more and more techniques to peer inside AI models, but understanding these systems as a whole remains far beyond our reach. What I’ve shown here is just one example of identifying one specific behavior at one particular moment. Even one minute of network traffic data might contain so many different decision pathways and interactions that it would take years to decode them all.

This is why the existential risk concerns from my earlier posts aren’t theoretical. We’re building systems whose internal reasoning we can only sample and test in fragments. Each technique in this toolkit gives us a small window into AI decision-making, but the complete picture stays hidden.

One final note: you’ll notice I included knowledge editing techniques like ROME and MEMIT in the toolkit, but I wouldn’t recommend using them in practice. That field of study should really be called “knowledge suppression” rather than unlearning or knowledge editing. It often creates more problems than it solves by introducing inconsistencies and unexpected behaviors elsewhere in the model. But that’s a topic for another day.