2018 Summer Paper Fest

ICML18 Paper

Been Kim+etal Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vector (TCAV)

Presenter: Nguyen Tuan Duong

Keywords: neural network interpretability, concept attribution, concept activation vector, hypothesis testing

Disclaimer: Image illustration in this presentation copied from the author ICML slides and the paper.

Model Interpretability

WHY?

  • Quantitative explanation to the decision made by a ML model
  • Does it align/reflect human knowledge/value?

Problem

Existing approaches

  • Linear models
    • easy to interpret as the the explanation is built into the model
    • but often too simple models; lower performance vs. deep models
  • Saliency map methods, e.g., SmoothGrad, Integrated Gradient
    • produce heap maps measuring the contribution of each pixel to the model output; useful to introspect pixel-based attribution; but no control over concepts
    • local explanation, per-input (e.g., image) explanation
    • potentially vulnarable to adversarial attacks
  • LIME
    • Local explanation
  • Influence functions
    • TODO(duong): Check this paper

Proposed method

Goal: How much (quantitatively) a concept (e.g., gender) was important for a model prediction? Even if the concept was not part of training data.

Doctor classifier Zebra classifier

Overview

TCAV

Diving into Details

Consider a feedforward neural network, classifying input images $x \in \mathbf{R}^n$ into $K (\geq 2)$ classes, a layer $l$ with $m$ neurons:

  • $f_l(x)$: takes input image $x$ and outputs layer $l$ activations $a \in \mathbf{R}^m$
  • $h_{l,k}(a)$: takes layer $l$ activation $a$ and outputs the class-k logit $\in \mathbf{R}$

CAV

For a user-defined concept $C$, collect a positive set of images representing the concept $P_C$, and a negative set of images without the concept $N_C$:

  • $A_P = \{~f_l(x)~|~ x \in P_C \}$
  • $A_N = \{~f_l(x)~|~ x \in N_C \}$

Train a binary linear classifier to distinguish between $A_P$ and $A_N$. The (linear) decision boundary is a hyperplane with the normal vector $v^l_C \in \mathbf{R}^m$. This $v^l_C$ is named concept activation vector (CAV), used as the representation for the concept $C$ in layer $l$ activation space.

Discussion

  • Why linear interpretation, e.g., using linear classifier?
  • What if $A_P$ and $A_N$ are not linearly separable?
  • CAV depends on the selection of $P_C$ and $N_C$, should CAV be chosen when the classifier achieves high accuracy?
  • In practice, we can train multi-class linear classifier for multiple concepts at once, e.g., use one-versus-rest logistic regression, etc.

Conceptual Sensitivity

In order to measure the sensitivity of the output logit w.r.t the conceptual change, i.e., the change in CAV, use directional derivative, $$S_{C,k,l}(x) = \underset{\epsilon \rightarrow 0}{\lim} \frac{h_{l,k}(f_l(x) + \epsilon v_C^l) - h_{l,k}(f_l(x))}{\epsilon} = \nabla h_{l,k}(f_l(x)) \cdot v_C^l \in \mathbf{R}$$ where $v_C^l$ is the unit CAV of the concept $C$ at layer $l$. Intuitively, this quantity $S_{C,k,l}(x)$ indicates how the output logit change w.r.t the change of the input $f_l(x)$ in the direction of the $v_C^l$.

$S_{C,k,l} > 0$ means the concept is positively correlated with the output logit, e.g., pertubation in the direction of $v_C^l$ is moving away from the decision boundary, and increases the probability of class $k$.

c.f. In saliency maps, the sensitivity of the class-k output logit w.r.t a pixel $(i,j)$ intensity is measured per pixel: $$S_{i,j} = \frac{\partial h_k(x)}{\partial x_{i,j}}$$

Testing with CAV (TCAV)

TCAV

Technical details

Aggregate per-input conceptual sensitivity over a class $k$ to measure class conceptual sensitivity.

$$TCAV_{C,k,l} = \frac{|\{ x \in \mathcal{X}_k: S_{C,k,l}(x) > 0\}|}{|\mathcal{X}_k|}$$ where $\mathcal{X}_k$ denotes all inputs belonging to class $k$.

Statistical significance testing

TCAV is sensitive to low-quality random CAV. To guard against spurious CAV:

  • compute TCAVs multiple ($T$) times using different $N_C$ sets and obtain $\{TCAV_{C,k,l}^{(i)}\}_{i=1}^T$
  • use $\{TCAV_{C,k,l}^{(i)}\}_{i=1}^T$ to perform two-sided t-test to test if the TCAV is statistically different from a random TCAV 0.5:
    • $H_0: TCAV_{C,k,l} = 0.5$ (random - not useful)
    • $H_1: TCAV_{C,k,l} \neq 0.5$

Discussion

  • $TCAV_{C,k,l}$ only aggregates the signs of conceptual sensitivities. Can we use different statistics, e.g., magnitude?

Relative TCAV

Q: For semantically related concepts (e.g., brown hair vs. black hair), how to compare their attributions?

For two concepts $C$ and $D$:

  • collect two sets of inputs representing the concepts respectively, $P_C$ and $P_D$
  • train a linear classifier on $f_l(P_C)$ and $f_l(P_D)$, yielding $v_{C,D}^l$ as the normal vector of the decision boundary
  • For an input $x$, $f_l(x) \cdot v_{C,D}^l$ measures whether $x$ is more relevant to $C$ or $D$

Experimental Results

Below are some cherry-picked results. Please see the paper for more complete and detailed experimental results.

GoogleNet

GoogleNet

Inception V3

InceptionV3

Where the concept is learned?

Where the concept is learned

Medical Application

Medical Application: Diabetic Retinopathy

Recap + Next Steps

  • TCAV (and CAV) is useful quantity to interpret conceptual attribution

    • high-level concept (e.g., gender) vs. numerical feature (e.g., pixels, activations)
    • any user-defined concept; not necessarily considered during model training
    • Apply post-training without retraining or modifying the original model
    • Global explanation: explanation for sets of examples or any class; not limited to a single data point as local explanation methods
  • Caveat: Correlation doesn't imply causation; but also doesn't exclude causation

  • Can be useful technique to debug models (e.g., detecting biases)
  • Code: https://github.com/tensorflow/tcav