ICML18 Paper
Been Kim+etal
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vector (TCAV)
Presenter: Nguyen Tuan Duong
Keywords: neural network interpretability, concept attribution, concept activation vector, hypothesis testing
Disclaimer: Image illustration in this presentation copied from the author ICML slides and the paper.
WHY?
Goal: How much (quantitatively) a concept (e.g., gender) was important for a model prediction? Even if the concept was not part of training data.
Consider a feedforward neural network, classifying input images $x \in \mathbf{R}^n$ into $K (\geq 2)$ classes, a layer $l$ with $m$ neurons:
For a user-defined concept $C$, collect a positive set of images representing the concept $P_C$, and a negative set of images without the concept $N_C$:
Train a binary linear classifier to distinguish between $A_P$ and $A_N$. The (linear) decision boundary is a hyperplane with the normal vector $v^l_C \in \mathbf{R}^m$. This $v^l_C$ is named concept activation vector (CAV), used as the representation for the concept $C$ in layer $l$ activation space.
In order to measure the sensitivity of the output logit w.r.t the conceptual change, i.e., the change in CAV, use directional derivative, $$S_{C,k,l}(x) = \underset{\epsilon \rightarrow 0}{\lim} \frac{h_{l,k}(f_l(x) + \epsilon v_C^l) - h_{l,k}(f_l(x))}{\epsilon} = \nabla h_{l,k}(f_l(x)) \cdot v_C^l \in \mathbf{R}$$ where $v_C^l$ is the unit CAV of the concept $C$ at layer $l$. Intuitively, this quantity $S_{C,k,l}(x)$ indicates how the output logit change w.r.t the change of the input $f_l(x)$ in the direction of the $v_C^l$.
$S_{C,k,l} > 0$ means the concept is positively correlated with the output logit, e.g., pertubation in the direction of $v_C^l$ is moving away from the decision boundary, and increases the probability of class $k$.
c.f. In saliency maps, the sensitivity of the class-k output logit w.r.t a pixel $(i,j)$ intensity is measured per pixel: $$S_{i,j} = \frac{\partial h_k(x)}{\partial x_{i,j}}$$
Aggregate per-input conceptual sensitivity over a class $k$ to measure class conceptual sensitivity.
$$TCAV_{C,k,l} = \frac{|\{ x \in \mathcal{X}_k: S_{C,k,l}(x) > 0\}|}{|\mathcal{X}_k|}$$ where $\mathcal{X}_k$ denotes all inputs belonging to class $k$.
TCAV is sensitive to low-quality random CAV. To guard against spurious CAV:
Q: For semantically related concepts (e.g., brown hair vs. black hair), how to compare their attributions?
For two concepts $C$ and $D$:
TCAV (and CAV) is useful quantity to interpret conceptual attribution
Caveat: Correlation doesn't imply causation; but also doesn't exclude causation